Go Optimizations 101 (2024)
Go Optimizations 101 (2024)
Go Optimizations 101 (2024)
I also would like to thank all gophers who ever made influences on this
book, be it directly or indirectly, intentionally or unintentionally.
Thanks to the authors of the following open source software and libraries
used in building this book:
golang, https://go.dev/
gomarkdown, https://github.com/gomarkdown/markdown
goini, https://github.com/zieckey/goini
go-epub, https://github.com/bmaupin/go-epub
pandoc, https://pandoc.org
calibre, https://calibre-ebook.com/
GIMP, https://www.gimp.org
Thanks the gophers who ever reported mistakes in this book or made
corrections for this book: yingzewen, ivanburak, cortes-, skeeto@reddit,
Yang Yang, DashJay, Stephan, etc.
About Go Optimizations 101
This book offers practical tricks, tips, and suggestions to optimize Go code
performance. Its insights are grounded in the official Go compiler and
runtime implementation.
Without particularly indicated, the code examples provided in this book are
tested and run on a notebook with the following environment setup:
some of the talked suggestions in this book work on any platform and
for any CPU models, but some others might only work on specified
platforms and for specified CPU models. So please benchmark them
on the same environments as your production environments before
adopting any of them.
some implementation details of the official standard Go compiler and
runtime might change from version to version, which means some of
the talked suggestions might not work for future Go toolchain
versions.
the book will be open sourced eventually, in a chapter by chapter way.
Tapir was ever (maybe will be again) an indie game developer. You can find
his games here: tapirgames.com.
About GoTV
During writing this book, the tool GoTV is used to manage installations of
multiple Go toolchain versions and check the behavior differences between
Go toolchain versions.
Feedback
Welcome to improve this book by submitting corrections to Go 101 issue
list (https://github.com/go101/go101) for all kinds of mistakes, such as
typos, grammar errors, wording inaccuracies, wrong explanations,
description flaws, code bugs, etc.
If a value always contains only one part, then the part may be also called as
the direct part of the value, and we say the value has no indirect parts.
boolean types
numeric types (int8, uint8, int16, uint16, int32, uint32, int64, uint64,
int, uint, uintptr, float32, float64, complex64, complex128)
pointer types
unsafe pointer types
struct types
array types
And a value of the following kinds of types always may contain one or
more indirect parts:
slice types
map types
channel types
function types
interface types
string types
When assigning/copying a value, only the direct part of the value is copied.
After copying, the direct parts of the destination and source values both are
referencing the indirect parts of the source value (if the indirect parts exist).
At run time, each value part is carried on one memory block (memory
blocks will be explained in a following chapter). So, if a value contains two
parts, the value is very possibly distributed on two memory blocks.
(Note: The terminology value part was invented by the Go 101 series
books. It is not widely used in Go community. Personally, I think the
terminology makes some conveniences when making some explanations.)
Value/type sizes
The size of a value part means how many bytes are needed to be allocated
in memory to store the value part at run time.
The size of a value exactly equals to the size of the direct part of the value.
In other words, the indirect parts of a value don't contribute to the size of
the value. The reason? It has been mentioned above: when
assigning/copying a value, only the direct part of the value is copied and the
indirect parts might be shared by multiple values, so it is not a good idea to
let the same indirect parts contribute to value sizes.
A struct values holds all its fields. In other words, a struct value is
composed of all its fields. At runtime, the fields of a struct are
allocated on the same memory block as the struct itself. Copying a
struct value means copying all the fields of the struct value. So all the
fields of a struct value contribute to the size of the struct value.
Like struct values, an array value holds all its elements. In other words,
an array is composed of all its elements. At runtime, the elements of an
array are allocated on the same memory block as the array itself.
Copying an array value means copying all the elements of the array
value. So all elements of an array contribute to the size of the array
value.
A pointer doesn't hold the value being referenced (pointed) by the pointer.
So the value being referenced by the pointer value doesn't contribute to the
size of the pointer value (so nil pointers and non-nil pointers have the same
size). The two values may be often allocated on two different memory
blocks, so copying one of them will not copy the other.
Internally, a slice uses a pointer (on the direct part) to reference all its
elements (on the indirect part). The length and capacity information (two
int values) of a slice is stored on the direct part of the slice. From memory
allocation point of view, it doesn't hold its elements. Its elements are
allocated on another (indirect) value part other than its direct part. When
assigning a slice value to another slice value, none elements of the slice get
copied. After assigning, the source slice and the destination slice both
reference (but not hold) the same elements. So the elements of a slice don't
contribute to the size of a specified slice. This is why the sizes of all slice
types are the same.
Like slice values, a map value just references all its entries and a buffered
channel value just references its elements being buffered.
a string just references all its containing bytes (on the indirect part),
though in logic, we can also think a string holds all its containing
bytes. The length information of a string is stored on the direct part of
the string as an int value.
an interface value just references its dynamic value, though in logic,
we can also think an interface value holds its dynamic value.
Detailed type sizes
The following table lists the sizes (used in the official standard Go
compiler) of all the 26 kinds of types in Go. In the table, one word means
one native word (4 bytes on 32-bit architectures and 8 bytes on 64-bit
architectures).
Memory alignments
To fully utilize CPU instructions and get the best performance, the (start)
addresses of the memory blocks allocated for (the direct parts of) values of
a specified type must be aligned as multiples of an integer N. Here N is
called the alignment guarantee of that type.
The following table lists the alignment guarantees made by the official
standard Go compiler for all kinds of types. Again, one native word is 4
bytes on 32-bit architectures and 8 bytes on 64-bit architectures.
Struct padding
To satisfy type alignment guarantee rules mentioned previously, Go
compilers may pad some bytes after certain fields of struct values. The
padded bytes are counted for struct sizes. So the size of a struct type may be
not a simple sum of the sizes of all its fields.
For example, the size of the struct type shown in the following code is 24
on 64-bit architectures.
type T1 struct {
a int8
// 7 bytes are padded here
b int64
c int16
// 6 bytes are padded here.
}
The alignment guarantee of the struct type is the same as its largest
alignment guarantee of its filed types. Here is the alignment guarantee
(8, a native word) of type int64 . This means the distance between the
address of the field b and a of a value of the struct type is a multiple
of 8. Clever compilers should choose the minimum possible value: 8.
To get the desired alignment, 7 bytes are padded after the field a .
The size of the struct type must be a multiple of the alignment
guarantee of the struct type. So considering the existence of the field
c, the minimum possible size is 24 (8x3), which should be used by
clever compilers. To get the desired size, 6 bytes are padded after the
field c .
Field orders matter in struct type size calculations. If we change the orders
of field b and c of the above struct type, then the size of the struct will
become to 16.
type T2 struct {
a int8
// 1 byte is padded here
c int16
// 4 bytes are padded here.
b int64
}
package main
import "unsafe"
type T1 struct {
a int8
b int64
c int16
}
type T2 struct {
a int8
c int16
b int64
}
func main(){
// The printed values are got on
// 64-bit architectures.
println(unsafe.Sizeof(T1{})) // 24
println(unsafe.Sizeof(T2{})) // 16
}
But what are small-size types? There is not a formal definition. In fact, the
definition depends on specific CPU architectures and compiler
implementations. In the official standard Go compiler implementation,
except large-size struct and array types, all other types in Go could be
viewed as small-size types.
What are small-size struct and array values? There is also not a formal
definition. The official standard Go compiler tweaks some implementation
details from version to version. However, in practice, we can view struct
types with no more than 4 native-word-size fields and array types with no
more than 4 native-word-size elements as small-size values, such as
struct{a, b, c, d int} , struct{element *T; len int; cap int} and
[4]uint .
For the official standard Go compiler 1.22 versions, a copy cost leap
happens between copying 9-element arrays and copying 10-element arrays
(the element size is one native word). The similar is for copying 9-field
structs and copying 10-field structs (each filed size is one native word).
The proof:
package copycost
import "testing"
const N = 1024
type Element = uint64
This results indicate copying arrays with less than 10 elements and structs
with less than 10 fields might be specially optimized.
The official standard Go compiler might use different criteria for other
scenarios to determine what are small struct and array types. For example,
in the following benchmark code, the Add4 function consumes much less
CPU resources than the Add5 function (with the official standard Go
compiler 1.22 versions).
package copycost
import "testing"
//go:noinline
func Add4(x, y T4) (z T4) {
z.a = x.a + y.a
z.b = x.b + y.b
z.c = x.c + y.c
z.d = x.d + y.d
return
}
//go:noinline
func Add5(x, y T5) (z T5) {
z.a = x.a + y.a
z.b = x.b + y.b
z.c = x.c + y.c
z.d = x.d + y.d
z.e = x.e + y.e
return
}
The //go:noinline compiler directives used here are to prevent the calls
to the two function from being inlined. If the directives are removed, the
Add4 function will become even more performant.
The following are several examples which show the costs of copying some
large-size values.
Example 1:
package copycost
import "testing"
const N = 1024
//go:noinline
func Sum_RangeArray(a [N]int) (r int) {
for _, v := range a {
r += v
}
return
}
//go:noinline
func Sum_RangeArrayPtr1(a *[N]int) (r int) {
for _, v := range *a {
r += v
}
return
}
//go:noinline
func Sum_RangeArrayPtr2(a *[N]int) (r int) {
for _, v := range a {
r += v
}
return
}
//go:noinline
func Sum_RangeSlice(a []int) (r int) {
for _, v := range a {
r += v
}
return
}
//===========
var r [128]int
Example 2:
package copycost
import "testing"
//go:noinline
func Sum_PlainForLoop(s []Element) (r int64) {
for i := 0; i < len(s); i++ {
r += s[i][0]
}
return
}
//go:noinline
func Sum_OneIterationVar(s []Element) (r int64) {
for i := range s {
r += s[i][0]
}
return
}
//go:noinline
func Sum_UseSecondIterationVar(s []Element) (r int64) {
for _, v := range s {
r += v[0]
}
return
}
//===================
var r [128]int64
package copycost
import "testing"
//go:noinline
func sum_UseSecondIterationVar(s []S) int {
var sum int
for _, v := range s {
sum += v.c
sum += v.d
sum += v.e
}
return sum
}
//go:noinline
func sum_OneIterationVar_Index(s []S) int {
var sum int
for i := range s {
sum += s[i].c
sum += s[i].d
sum += s[i].e
}
return sum
}
//go:noinline
func sum_OneIterationVar_Ptr(s []S) int {
var sum int
for i := range s {
v := &s[i]
sum += v.c
sum += v.d
sum += v.e
}
return sum
}
Please note that, if we reduce the field count from 5 to 4 in the above
example, then there is no performance differences between the three
functions. The reason has been mentioned above. A struct type with no
more than 4 fields of native word size is treated as a small-size type.
Example 4:
// demo-largesize-loop-var.go
package main
import (
"fmt"
"time"
)
func foo() {
for a, i := (Large{}), 0; i < len(a); i++ {
readOnly(&a, i)
}
}
func main() {
bench := func() time.Duration {
start := time.Now()
foo()
return time.Since(start)
}
fmt.Println("elapsed time:", bench())
}
From the outputs, we can find that the semantic changes made in Go 1.22
causes significant performance regression for the above code. So, when
using Go toolchain 1.22+ versions, try not to declare large-size values as
loop variables.
Memory Allocations
Memory blocks
The basic memory allocation units are called memory blocks. A memory
block is a continuous memory segment. As aforementioned, at run time, a
value part is carried on a single memory block.
A single memory block might carry multiple value parts. The size of a
memory block must not be smaller than the size of any value part it carries.
When a memory block is carrying a value part, we may say the value part is
referencing the memory bock.
Collecting stack memory blocks is also much cheaper than collecting heap
memory blocks. In fact, stack memory blocks don't need to be collected.
The stack of a goroutine could be actually viewed as a single memory
block, and it will be collected as a whole when the goroutine exits.
On the other hand, when all the value parts being carried on/by a heap
memory block are not used any more (in other words, no alive value part is
still referencing the memory block), the memory block will be viewed as
garbage and automatically collected eventually, during runtime garbage
collection cycles, which might consume certain CPU resources (garbage
collection will be talked in detail in a later chapter). Generally, the more
memory blocks are allocated on heap, the larger pressure is made for
garbage collection.
The escape analysis module of a Go compiler could detect some value parts
will be only used by one goroutine and try to let those value parts allocated
on stack at run time if certain extra conditions are satisfied. Stack memory
allocations and escape analysis will be explained with more details in the
next chapter.
Memory allocation scenarios
Generally, each of the following operations will make at least one
allocation.
declare variables
call the built-in new function.
call the built-in make function.
modify slices and maps with composite literals.
convert integers to strings.
concatenate strings by using use + .
convert between strings to byte slices, and vice versa.
convert strings to rune slices.
box values into interfaces (converting non-interface values into
interfaces).
append elements to a slice and the capacity of the slice is not large
enough.
put new entries into maps and the underlying array (to store entries) of
the map is not large enough to store the new entries.
some memory block size classes (no more than 32768 bytes) are
predefined. As of the official standard Go compiler version 1.22.x, the
smallest size classes are 8, 16, 24, 32, 48, 64, 80 and 96 bytes.
For memory blocks larger than 32768 bytes, each of them is always
composed of multiple memory pages. The memory page size used by
the official standard Go runtime (1.22 versions) is 8192 bytes.
So,
to allocate a (heap) memory block for the value which size is in the
range [33, 48] , the size of the memory block is general (must be at
least) 48. In other words, there might be up to 15 bytes wasted (if the
value size is 33).
to create a byte slice with 32769 elements on heap, the size of the
memory block carrying the elements of the slice is 40960 (32768 +
8192, 5 memory pages). In other words, 8191 bytes are wasted.
In other words, memory blocks are often larger than needed. The strategies
are made to manage memory easily and efficiently, but might cause a bit
memory wasting sometimes (yes, a trade-off).
package main
import "testing"
import "unsafe"
var t *[5]int64
var s []byte
func main() {
println(unsafe.Sizeof(*t)) // 40
rf := testing.Benchmark(f)
println(rf.AllocedBytesPerOp()) // 48
rg := testing.Benchmark(g)
println(rg.AllocedBytesPerOp()) // 40960
}
Another example:
package main
import "testing"
func main() {
br := testing.Benchmark(Concat)
println(br.AllocsPerOp()) // 3
println(br.AllocedBytesPerOp()) // 176
}
There are 3 allocations made within the Concat function. Two of them are
caused by the byte slice to string conversions string(s) , and the sizes of
the two memory blocks carrying the underlying bytes of the two result
strings are both 48 (which is the smallest size class which is not smaller
than 33). The third allocation is caused by the string concatenation, and the
size of the result memory block is 80 (the smallest size class which is not
smaller than 66). The three allocations allocate 176 (48+48+80) bytes
totally. In the end, 14 bytes are wasted. And 44 (15 + 15 + 14) bytes are
wasted during executing the Concat function.
In the above example, the results of the string(s) conversions are used
temporarily in the string concatenation operation. By the current official
standard Go compiler/runtime implementation (1.22 versions), the string
bytes are allocated on heap (see below sections for details). After the
concatenation is done, the memory blocks carrying the string bytes become
into memory garbage and will be collected eventually later.
Memory is cheap nowadays, but this is not true for the memory sold by
cloud computing providers. So if we run programs on cloud servers, the
more memory is saved by the Go programs, the more money is saved.
The following are some suggestions to reduce memory allocations and save
memory in programming.
For example, to merge some slices into one, the following shown
MergeWithTwoLoops implementation is more efficient than the
MergeWithOneLoop implementation, because the former one makes less
allocations and copies less values.
package allocations
import "testing"
package main
import "fmt"
From the outputs, we could get that only the last append call doesn't
allocate.
package allocations
import "testing"
package allocations
import "testing"
const N = 100
//go:noinline
func CreateBooksOnOneLargeBlock(n int) []*Book {
books := make([]Book, n)
pbooks := make([]*Book, n)
for i := range pbooks {
pbooks[i] = &books[i]
}
return pbooks
}
//go:noinline
func CreateBooksOnManySmallBlocks(n int) []*Book {
books := make([]*Book, n)
for i := range books {
books[i] = new(Book)
}
return books
}
From the results, we could get that allocating many small value parts on one
large memory block
For example, there are many non-player characters (NPC) in RTS games. A
large quantity of NPCs will be spawned and destroyed from time to time in
a game session. The related code is like
We could instead use a cache pool to reduce allocations, like the code
shown below.
import "container/list"
We could also set a max size for the pool to avoid the pool occupies too
much memory.
Escape analysis
Not all value parts are capable of being allocated on stack. One principle
condition to allocate a value part on stack is the value part will be only used
in one goroutine (the current one) during its life time. Generally, if the
compiler detects a value part is used by more than one goroutine or it is
unable to make sure whether or not the value part is used by only one
goroutine, then it will let the value part allocated on heap. We also say the
value part escapes (to heap).
Please note that a value part could be allocated on stack doesn't mean the
value part will be allocated on stack for sure at run time. If the size of the
value part is too large, then the compiler will still let the value part allocated
on heap anyway. The size thresholds used in the official standard Go
compiler will be introduced in later sections of this chapter.
The basic escape analysis units are functions. Only the local values will be
escape analyzed. All package-level variables are allocated on heap for sure.
The -m compiler option is used to show escape analysis result. Let's view
an example.
// escape.go
package main
func main() {
var (
a = 1 // moved to heap: a
b = false
c = make(chan struct{})
)
go func() {
if b {
a++
}
close(c)
}()
<-c
println(a, b) // 1 false
}
Why the variable b is allocated on stack but the variable a escapes? Aren't
they both used on two goroutines? The reason is that the escape analysis
module is so smart that it detects the variable b is never modified and
thinks it is a good idea to use a (hidden implicit) copy of the variable b in
the new goroutine.
Let's add one new line b = !b before the print line and run it again.
// escape.go
package main
func main() {
var (
a = 1 // moved to heap: a
b = false // moved to heap: b
c = make(chan struct{})
)
go func() {
if b {
a++
}
close(c)
}()
<-c
b = !b
println(a, b) // 1 true
}
The outputs:
Now both the variable a and the variable b escape. In fact, for this
specified example, the compiler could still use a copy of variable b in the
new goroutine. But it is too expensive to let the escape analysis module
analyze the concurrency synchronizations used in code.
For the similar reason, the escape analysis module also doesn't try to check
whether or not the variable a will be really modified. If we change b to a
constant, then the variable a will be allocated on stack, because the line
a++ will be optimized away.
We may use the -S compiler option to show function stack frame sizes.
Let's use the following program as an example:
// frame.go
package main
import (
"fmt"
"math/rand"
)
func main() {
x := foo(123)
fmt.Println(x)
duck()
}
var v interface{}
//go:noinline
func duck() {
if v != nil {
v = [16000]byte{}
panic("unreachable")
}
}
Run it with the -S compiler option, we will get the following outputs (some
texts are omitted):
For some goroutines, as function calls go deeper and deeper, more and more
function call stack frames are needed on the stack, the current stack size
might become insufficient. So stacks will grow on demand during the
execution of a goroutine. On the contrary, when the function call depth
become shallower and shallower, the stack size of a goroutine will become
too large. So stacks might also shrinkage. However, to avoid growing and
shrinking frequently, a stack will not shrink immediately as soon as it
becomes too large. The stack of a goroutine will shrink only when it is
found too large during a garbage collection cycle and the goroutine isn't
doing anything or sitting in a system call or a cgo call. Each stack shrinkage
halves the old stack. (Special thanks to Ian Lance Taylor for the
explanations.)
package main
import "runtime"
//go:noinline
func f(i int) byte {
var a [1<<13]byte // allocated on stack and make stack grow
return a[i]
}
func main(){
var x int
println(&x) // <address 1>
f(1) // (make stack grow)
println(&x) // <address 2>
runtime.GC() // (make stack shrink)
println(&x) // <address 3>
runtime.GC() // (make stack shrink)
println(&x) // <address 4>
runtime.GC() // (stack does not shrink)
println(&x) // <address 4>
}
Note that each of the first two manual runtime.GC calls causes a stack
shrinkage, but the last one doesn't.
Let's make an analysis on how the stack of the main goroutine grows and
shrinks during the execution of the program.
The reason why the array a is allocated on stack will be explained in the a
later section in this chapter.
The following is an example program which will crash for stack exceeds
limit.
package main
func main() {
var x [N]byte
f(x, 50)
}
If the call f(x, 50) is changed to f(x, 48) , then the program will exit
without crashing (on 64-bit systems).
We can call the runtime/debug.SetMaxStack function to change the global
stack maximum limit setting. There is not a formal way to control the initial
stack size of a goroutine, though an informal way will be provided in a later
section of this chapter.
For example, the variable n in the following code will escape to heap. The
reason is there might be many coexisting instances of n if its containing
loop needs many steps to finish. The number of the instances is often hard
or impossible to determine at compile time, whereas the stack frame size of
a function must be determined at compile time, so the compiler lets every
instance of n escape to heap.
It is admitted that, for this specified example, the compiler could be smarter
to determine that the loop step is one and only one coexisting instance of n
is needed. The current the official standard Go compiler (version 1.22.n)
doesn't consider such special cases.
package main
func main() {
var x *int
for {
var n = 1 // moved to heap: n
x = &n
break
}
_ = x
}
For example, in the following code, the value x will be allocated on stack,
but the value y will be allocated on heap.
package main
type I interface {
M(*int)
}
type T struct{}
func (T) M(*int) {}
var t T
var i I = t
func main() {
var x int // does not escape
t.M(&x)
var y int // moved to heap: y
i.M(&y)
}
For some case, the compiler could determine the dynamic value (and
therefor the concrete method) of an interface value at compile time. If the
compiler finds the concrete method doesn't pass an argument to other
goroutines, then it will let the value parts referenced by the argument not
escape to heap. For example, in the following code, the value x and y are
both allocated on stack. The reason why the value y doesn't escape is the
method call i.M(&y) is de-virtualized to t.M(&y) at compile time.
package main
type I interface{
M(*int)
}
type T struct{}
func (T) M(*int) {}
func main() {
var t T
var i I = t
var x int
t.M(&x)
var y int
i.M(&y)
}
// reflect-value-escape-analysis.go
package main
import "reflect"
var x reflect.Value
func main() {
var n = 100000 // line 9
_ = reflect.ValueOf(&n)
var k = 100000
_ = reflect.ValueOf(k) // line 13
var q = 100000
x = reflect.ValueOf(q) // line 16
}
Please note that the message q escapes to heap actually means a copy
of q escapes to heap (similar for k in 1.20 outputs).
A such call will always make an allocation on heap to create a copy for
each of its argument (if that argument is not an interface) and makes the
values referenced by its arguments escape to heap.
For example, in the following code, the variable x will escape to heap, but
the variable y doesn't. And a copy of z is allocated on heap.
package main
import "fmt"
func main() {
var x = 1 << 20 // moved to heap: x
fmt.Println(&x)
var y = 2 << 20 // y does not escape
println(&y)
var z = 3 << 20
fmt.Println(z) // z escapes to heap
}
It is expensive to trace how the return results of a function are used by the
callers of the function. So the compiler lets all value parts referenced by the
results escape.
For example, in the following code, assume the function f is not inline-
able, then the value *p will escape to heap, for it is referenced by the return
result. Note, the compiler is smart enough to decide the value t , which is
referenced by the argument x passed to the f function call, could be safely
allocated on stack.
package main
//go:noinline
func f(x *int) *int {
var n = *x + 1 // moved to heap: n
return &n
}
func main() {
var t = 1 // does not escape
var p = f(&t)
println(*p) // 2
println(&t) // 0xc000034758
println(p) // 0xc0000140c0
}
Still use the example in the last section as an example, but remove the
//go:noinline line to make the function f inline-able, then the value *p
will be allocated on stack.
package main
func main() {
var t = 1
var p = f(&t)
println(*p) // 2
println(&t) // 0xc000034760
println(p) // 0xc000034768
}
The printed two addresses show the distance between the value t and *p is
the size of the value size of *p , which indicates the two values are both
allocated on stack. Please note that the message "moved to heap: n" is
still there, but it is for the f function calls which are not inlined (no such
calls in this tiny program).
The following is the rewritten code (by the compiler) after inlining:
package main
func main() {
var t = 1
var s = &t
var n = *s + 1
var p = &n
println(*p)
println(&t) // 0xc000034760
println(p) // 0xc000034768
}
After the rewrite, the compiler easily knows the n variable is only used in
the current goroutine so lets it not escape.
// constinline.go
package main
func main() {
var x = createSlice(32) // line 9
var y = make([]byte, 32) // line 10
_, _ = x, y
}
Run it:
On the other hand, it is generally a good idea to try to let small memory
blocks allocated on stack if it is possible.
The following will introduce some facts and tips to control memory block
allocation places.
For some rare scenarios, we might expect some values which are only used
in the current goroutine to be allocated on heap at run time. There are two
ways to make sure a value v will escape to heap at run time:
The standard packages internally use a trick like the following code shows
to make some values escape for sure. The trick uses a package-level
variable sink (which is allocated on heap for sure) to reference the value
which is expected to escape. The line sink = x will be optimized away but
it will still affect the decisions made by the escape analysis module. Please
note that, a value referencing the value expected to escape should be passed
to the escape function. Generally, the passed value is a pointer to the value
expected to escape.
//go:noinline
func escape(x interface{}) {
sink = x
sink = nil
}
package main
//go:noinline
func escape(x interface{}) {
sink = x
sink = nil
}
func main() {
var a = 1 // moved to heap: a
var b = true // moved to heap: b
escape(&a)
escape(&b)
println(a, b)
}
sink = x
sink = nil
}
Surely, if we know some values will be allocated on heap, then we can just
let those values reference the values expected to escape.
package main
func main() {
var (
a = 1 // moved to heap: a
b = true // moved to heap: b
c = make(chan struct{})
)
go func() {
if b {
a++
}
close(c)
}()
<-c
b = !b
println(a, b) // 2 false
}
package main
func main() {
var (
a = 1 // doesn't escape
b = true // doesn't escape
c = make(chan int)
)
b1 := b
go func(a int) {
if b1 {
a++
c <- a
}
}(a)
a = <-c
b = !b
println(a, b) // 2 false
}
For this specified example, whether or not variables a and b escape has a
very small effect on overall program performance. But the tip introduced
here might be helpful elsewhere.
Memory size thresholds used by the compiler to make
allocation placement decisions
The following are some examples to show the effects of these thresholds.
Example 1:
package main
import "testing"
func h() {
_ = []byte(S) // ([]byte)(S) does not escape
}
func main() {
stat := func( f func() ) int {
allocs := testing.AllocsPerRun(10, f)
return int(allocs)
}
println(stat(f(bs33))) // 1 (heap allocation)
println(stat(f(bs33[:32]))) // 0 (heap allocations)
Please note: the "XXX does not escape" analysis message means the
compiler knows that XXX is used only by one goroutine, but if the
corresponding threshold is surpassed, then XXX will be allocated on heap
(special thanks to Keith Randall for the explanations).
From the outputs of example 1, we could affirm the byte elements of the
result strings or byte slices of conversions between string and byte slice will
be always allocated on heap if the length of the result is larger than 32.
Example 2:
package main
import "testing"
func main() {
stat := func( f func() byte ) int {
allocs := testing.AllocsPerRun(10, func() {
f()
})
return int(allocs)
}
println(stat(new65537)) // 1 (heap allocation)
println(stat(new65535)) // 0 (heap allocations)
println(stat(comptr65537)) // 1 (heap allocation)
println(stat(comptr65535)) // 0 (heap allocations)
}
From the outputs of example 2, we could affirm that the T value created in
new(T) and &T{} will be always allocated on heap if the size of type T is
larger than 64K (65536) bytes.
(Please note that, we deliberately ignore the case of size 65536. Before Go
toolchain 1.17, T values will be allocated on heap if the size of T is 65536.
Go toolchain v1.17 changed this.)
Example 3:
package main
import "testing"
func main() {
stat := func( f func() bool) int {
allocs := testing.AllocsPerRun(10, func() {
f()
})
return int(allocs)
}
println(stat(makeSlice65537)) // 1 (heap allocation)
println(stat(makeSlice65535)) // 0 (heap allocations)
println(stat(makeSliceVarSize)) // 1 (heap allocation)
}
if N is a constant and the size of the backing array [N]T of the result of
the make([]T, N) call is larger than 64K (64 * 1024) bytes, then the
backing array will be allocated on heap.
if n is not a constant and n is larger than zero, then the backing array
will be always allocated on heap, for compilers couldn't determine the
backing array size of the result slice at compile time.
(Again, please also note that, if the constant N equals to 65536, the
elements of make([]T, N) will be allocated on heap before Go toolchain
v1.17, but will be allocated on stack since Go toolchain v1.17.)
Example 4:
package main
import "testing"
func main() {
stat := func( f func() byte) int {
allocs := testing.AllocsPerRun(10, func() {
f()
})
return int(allocs)
}
println(stat(declare10M)) // 0
println(stat(declare10Mplus1)) // 1
println(stat(redeclare10M)) // 0
println(stat(redeclare10Mplus1)) // 1
}
Please note, the above mentioned thresholds 64K and 10M will become
much smaller (16K and 128K respectively) if the -smallframes compiler
option is specified. For example, if we use the command go run -
gcflags='-smallframes' main.go to run the following program, then all
the checked functions will make one allocation on heap.
// main.go
package main
import "testing"
func main() {
stat := func( f func() byte ) int {
allocs := testing.AllocsPerRun(10, func() {
f()
})
return int(allocs)
}
If we use the command go run main.go to run the program, then none of
the checked functions will make allocations on heap.
We have learned that the largest slice backing array which could be
allocated on stack is 65536 (or 65535 before Go toolchain v1.17). But there
is a tip to raise the limit to 10M: derive slices from a stack allocated array.
For example, the elements of the slice s created in the following program
are allocated on stack. The length of s is 10M, which is far larger than
65536.
package main
import "testing"
func main() {
stat := func( f func() byte ) int {
allocs := testing.AllocsPerRun(10, func() {
f()
})
return int(allocs)
}
println(stat(f)) // 0
}
package main
import "testing"
func main() {
stat := func( f func() byte ) int {
allocs := testing.AllocsPerRun(10, func() {
f()
})
return int(allocs)
}
println(stat(createSlice)) // 0 (heap allocations)
}
There are more corner cases in which values with arbitrary sizes will be
allocated on stack. Similarly, future Go toolchain versions might (or might
not) change the implementation details so that the allocation behaviors for
these cases might change later.
//go:noinline
func boxLargeSizeValue() {
var x interface{} = [N]byte{} // 1
println(x != nil)
}
//go:noinline
func largeSizeParameter(x [N]byte) { // 2
}
//go:noinline
func largeSizeElement() {
var m map[int][N]byte
m[0] = [N]byte{} // 3
}
However, as what has been mentioned in the chapter before the last, to
avoid large value copy costs, generally, we should not
That is why the three cases shown in the above code snippet are corner
cases. Generally, we should not write such code in practice.
import "fmt"
import "time"
func main() {
var c = make(chan time.Duration, 1)
go foo(c)
fmt.Println("foo:", <-c)
go bar(c)
fmt.Println("bar:", <-c)
}
Run it:
$ go run init-stack-size.go
foo: 42.051369ms
bar: 4.740523ms
From the outputs, we could get that the efficiency of the bar goroutine is
higher than the foo goroutine. The reason is simple, only one stack growth
happens in the lifetime of the bar goroutine, whereas more than 10 stack
growths happen in the lifetime of the foo goroutine.
During the scan+mark phase of a GC cycle, the garbage collector will scan
all pointers in the already-known alive value parts to find more alive value
parts (referenced by those already-known alive ones), until no more alive
value parts need to be scanned. In the process, all heap memory blocks
hosting alive value parts are marked as non-garbage memory blocks.
During the sweep phase of a GC cycle, the heap memory blocks which are
not marked as non-garbage will be viewed as garbage and collected.
So, generally speaking, the more pointers are used, the more pressure is
made for GC (for the more scan work to do).
GC pacer
At run time, a new GC cycle will start automatically when certain
conditions are reached. The current automatic GC pacer design includes the
following scheduling strategies:
(Note: The above descriptions are rough. The official standard Go runtime
also considers some other factors in the GC pacer implementation, which is
why the above descriptions use the "approximately" wording.)
The second strategy is an auxiliary strategy. This article will not talk more
about it. With the other two strategies in play, the more frequently memory
blocks are allocated on heap, the more frequently GC cycles start.
Please note that, the current GC pacer design is not promised to be perfect
for every use cases. It will be improved constantly in future official
standard Go compiler/runtime versions.
In fact, pointers could be often used to reduce copy costs. So whether or not
some pointers should be used depends on specific situations. We should
only avoid using pointers in the situations where their drawbacks are
obvious.
Memory fragments
With memory blocks being allocated and collected from time to time,
memory will become fragmented. The current official standard Go runtime
implementation manages memory in an effective way (based on tcmalloc),
so generally memory fragmenting is not a problem in Go programming.
several elements of a slice (or array or string) are still alive but others
become unused.
several fields of a struct value are still alive but others become unused.
a hybrid of the above two.
package main
import (
"log"
"runtime"
"time"
)
func main() {
log.SetFlags(0)
var p = &s[999]
runtime.GC()
// log.Println(*p) // 999
_ = p
time.Sleep(time.Second)
}
The output indicates that the a new GC cycle (triggered by the manual
runtime.GC call) makes the memory block carrying the slice elements
collected, otherwise the fanalizers of these elements will not get executed.
Let's turn on the log.Println(*p) line and run the program again, then it
will merely print 999 , which indicates the memory block carrying the slice
elements is still not collected when the manual GC cycle ends. Yes, the fact
that the last element in the slice will still be used prevents the whole
memory block from being collected.
We may copy the long-lived tiny value part (so that it will be carried on a
small memory block) to let the old larger memory block collected. For
example, we may replace the following line in the above program (with the
log.Println(*p) line turned on):
var p = &a[999]
with
, then the manual GC will make the memory block carrying the slice
elements collected.
The return results of some functions in the standard strings (or bytes )
packages are substrings (or subslices) of some arguments passed to these
functions. Such functions include Fields , Split and Trim functions.
Similarly, we should duplicate such a result substring (or subslice) if it is
long lived but its length is small, and the corresponding argument is short
lived but has a large length; otherwise, the memory block carrying the
underlying bytes of (the result and the argument) will not get collected in
time.
There are also some other scenarios in which short-lived value parts will be
generated.
The memory blocks hosting roots are called root memory blocks. For the
official standard Go runtime, before version 1.18, roots have no impact on
the new heap memory percentage strategy; since version 1.18, they have.
If the new heap memory percentage strategy is enabled (the GOGC value is
non-negative), when the scan+mark phase of a GC cycle is just done, the
official standard Go runtime (version 1.18+) will calculate the target heap
size for the next garbage collection cycle, from the non-garbage heap
memory total size (call live heap here) and the root memory block total size
(called GC roots here), according to the following formula:
When heap memory total size (approximately) exceeds the calculated target
heap size, the next GC cycle will start automatically.
Note: the minimum target heap size is (GOGC * 4 / 100)MB , which is also
the target heap size for the first GC cycle.
Note: in the following outputs of examples, to keep the each output line
short, not all of these fields will be shown.
Let's use an unreal program as an example to show how to use this option.
// gctrace.go
package main
import (
"math/rand"
"time"
)
var x [512][]*int
func garbageProducer() {
rand.Seed(time.Now().UnixNano())
for i := 0; ; i++ {
n := 6 + rand.Intn(6)
for j := range x {
x[j] = make([]*int, 1<<n)
for k := range x[j] {
x[j][k] = new(int)
}
}
time.Sleep(time.Second / 1000)
}
}
func main() {
garbageProducer() // never exit
}
Run it with the gctrace=1 GODEBUG environment variable option
(several unrelated starting lines are omitted in the outputs).
Here, the #->#-># MB and # MB goal fields are what we have most
interests in. In a #->#-># MB field,
the last number is non-garbage heap memory total size (a.k.a. live
heap).
the first number is the heap size at a GC cycle start, which should be
approximately equal to the target heap size (the number in the # MB
goal field of the same line).
The reason of the findings is that the live heap size is small (staying roughly
under 5MiB) and staggers much, but the root memory block total size is
almost zero.
One way to reduce the time spent on GC is to increase the GOGC value:
(On Windows, the DOS command should be set "GOGC=1000" & set
"GODEBUG=gctrace=1" & "go run gctrace.go" .)
From the above outputs, we could find that, after increase the GOGC value to
1000, GC cycle intervals and the heap sizes at GC cycle beginning both
become much larger now. But GC cycle intervals and live heap sizes still
stagger much, which might be a problem for some programs. The following
sections will introduce some ways to solve the problems.
// bigstacks.go
package main
import (
"math/rand"
"time"
)
var x [512][]*int
func garbageProducer() {
rand.Seed(time.Now().UnixNano())
for i := 0; ; i++ {
n := 6 + rand.Intn(6)
for j := range x {
x[j] = make([]*int, 1<<n)
for k := range x[j] {
x[j][k] = new(int)
}
}
time.Sleep(time.Second / 1000)
}
}
return s[v]
}
func main() {
go bigStack(nil, 123)
garbageProducer() // never exit
}
$ go version
go version go1.21.2 linux/amd64
The outputs show that GC cycle intervals become larger and much less
staggering (with Go toolchain version 1.18+).
In calculating the root memory block total size, the official standard runtime
version 1.18 uses stack memory segment sizes, whereas version 1.19+ use
the real sizes of stacks (however, the numbers shown in the # MB stacks
elements are still stack memory segment sizes in 1.19 outputs). So, when
using Go toolchain 1.18 to run the above example program, the target heap
sizes will be some larger:
$ go version
go version go1.18.5 linux/amd64
// bigglobals.go
package main
import (
"math/rand"
"time"
)
var x [512][]*int
func garbageProducer() {
rand.Seed(time.Now().UnixNano())
for i := 0; ; i++ {
n := 6 + rand.Intn(6)
for j := range x {
x[j] = make([]*int, 1<<n)
for k := range x[j] {
x[j][k] = new(int)
}
}
time.Sleep(time.Second / 1000)
}
}
func main() {
garbageProducer() // never exit
println(bigGlobal.p) // unreachable
}
The same, the outputs show that GC cycle intervals become larger and
much less staggering (with Go toolchain version 1.18+).
The examples in the current section show that root memory blocks
objectively act as memory ballasts.
The next section will introduce a memory ballast trick which works with
Go toolchain version 1.18-.
import "runtime"
func main() {
// ballastSize is value much larger than the
// maximum possible live heap size of the program.
ballast := make([]byte, ballastSize)
programRun()
runtime.KeepAlive(&ballast)
}
The trick allocate a slice with a big element sum size. The size contributes
to non-garbage heap size.
Let's modify the gctrace example shown above as:
// gcballast.go
package main
import (
"math/rand"
"runtime"
"time"
)
var x [512][]*int
func garbageProducer() {
rand.Seed(time.Now().UnixNano())
for i := 0; ; i++ {
n := 6 + rand.Intn(6)
for j := range x {
x[j] = make([]*int, 1<<n)
for k := range x[j] {
x[j][k] = new(int)
}
}
time.Sleep(time.Second / 1000)
}
}
func main() {
const ballastSize = 150 << 20 // 150 MiB
ballast := make([]byte, ballastSize)
garbageProducer()
runtime.KeepAlive(&ballast)
}
This program uses a 150MiB memory ballast, so that the non-garbage heap
size (live heap) of the program keeps at about 150-160MiB. Consequently,
the target heap size of the program keeps at a bit over 300MiB (assume the
GOGC value is 100 ). This makes GC cycle intervals become larger and much
less staggering.
Note: the elements of the local slice is never used, so the allocated memory
block for the elements is only allocated virtually, not physically (at least on
Linux). This means the elements of the slice don't consume physical
memory, which is an advantage over using root memory blocks as memory
ballasts.
Use Go toolchain 1.19 introduced memory limit
strategy to avoid frequent GC cycles
Go official standard compiler 1.19 introduced a new scheduling strategy:
the memory limit strategy. The strategy may be configured either via the
GOMEMLIMIT environment variable or through the
runtime/debug.SetMemoryLimit function. This memory limit sets a
maximum on the total amount of memory that the Go runtime should use.
In other words, if the total amount of memory Go runtime uses
(approximately) surpasses the limit, a new garbage collection process will
start. The limit is soft, a Go program will not exit when this limit is
exceeded. The default value of the memory limit is math.MaxInt64 , which
effectively disables this strategy.
The memory limit strategy and the new heap memory percentage strategy
may take effect together. For demonstration purpose, let's disable the new
heap memory percentage strategy and enable the memory limit strategy to
run the gctrace example program shown above again. Please make sure to
run the program with Go toolchain v1.19+; otherwise, the GOMEMLIMIT
environment variable will not get recognized so that automatic garbage
collection will be turned off totally.
$ go version
go version go1.19 linux/amd64
$ GOMEMLIMIT=175MiB GOGC=off GODEBUG=gctrace=1 go run
gctrace.go
gc 1 @0.283s 1%: ..., 153->155->18 MB, 157 MB goal, ...
gc 2 @0.784s 1%: ..., 151->152->6 MB, 155 MB goal, ...
gc 3 @1.445s 1%: ..., 149->151->7 MB, 152 MB goal, ...
gc 4 @2.113s 1%: ..., 148->150->8 MB, 152 MB goal, ...
gc 5 @2.765s 1%: ..., 148->150->18 MB, 152 MB goal, ...
gc 6 @3.394s 1%: ..., 147->150->18 MB, 152 MB goal, ...
gc 7 @4.006s 1%: ..., 147->151->19 MB, 152 MB goal, ...
gc 8 @4.653s 1%: ..., 147->147->1 MB, 152 MB goal, ...
gc 9 @5.326s 1%: ..., 148->152->14 MB, 152 MB goal, ...
gc 10 @6.007s 1%: ..., 148->149->9 MB, 152 MB goal, ...
...
From the outputs, we could find that, the memory limit strategy can also
make GC cycle intervals become much larger and less staggering.
One goal of the memory limit strategy to provide an official way to replace
the memory ballast trick. For some Go programs, if the memory limit
values are set properly, the goal will be achieved. However, sometimes, it
might be difficult to choose a proper value for the limit. If the value is set
smaller than the amount memory of the Go program really needs, then GC
cycles will start frequently. This is a disadvantage compared to the memory
ballast solutions shown above. Please read the suggestions for using the
memory limit strategy (that containing article is an excellent article to
understand garbage collection in Go).
Pointers
Avoid unnecessary nil array pointer checks in a
loop
There are some flaws in the current official standard Go compiler
implementation (v1.22.n). One of them is some nil array pointer checks are
not moved out of loops. Here is an example to show this flaw.
// unnecessary-checks.go
package pointers
import "testing"
const N = 1000
var a [N]int
//go:noinline
func g0(a *[N]int) {
for i := range a {
a[i] = i // line 12
}
}
//go:noinline
func g1(a *[N]int) {
_ = *a // line 18
for i := range a {
a[i] = i // line 20
}
}
func Benchmark_g0(b *testing.B) {
for i := 0; i < b.N; i++ { g0(&a) }
}
Let's run the benchmarks with the -S compiler option, the following
outputs are got (uninterested texts are omitted):
//go:noinline
func g2(x *[N]int) {
a := x[:]
for i := range a {
a[i] = i
}
}
Please note that the flaw might be fixed in future compiler versions.
And please note that, if the three implementation functions are inline-able,
the benchmark results will change much. That is the reason why the
//go:noinline compiler directives are used here. (Before Go toolchain
v1.18, the //go:noinline compiler directives are actually unnecessary
here. Because Go toolchain v1.18- never inlines a function containing a
for-range loop.)
For the cases in which an array pointer is a struct field, things are a little
complex. The _ = *t.a line in the following code is useless to avoid the
compiler flaw. For example, in the following code, the performance
difference between the f1 function and the f0 function is small. (In fact,
the f1 function might be even slower, if a NOP instruction is generated
within its loop.)
type T struct {
a *[N]int
}
//go:noinline
func f0(t *T) {
for i := range t.a {
t.a[i] = i
}
}
//go:noinline
func f1(t *T) {
_ = *t.a
for i := range t.a {
t.a[i] = i
}
}
To move the nil array pointer checks out of the loop, we should copy the
t.a field to a local variable, then adopt the trick introduced above:
//go:noinline
func f3(t *T) {
a := t.a
_ = *a
for i := range a {
a[i] = i
}
}
//go:noinline
func f4(t *T) {
a := t.a[:]
for i := range a {
a[i] = i
}
}
//go:noinline
func f2(t *T) {
a := t.a
for i := range a {
a[i] = i
}
}
The f2 implementation is not fast as the f3 and f4 implementations, but it
is faster than the f0 and f1 implementations. However, that is another
story.
If the elements of an array pointer field are not modified (only read) in the
loop, then the f1 way is as performant as the f3 and f4 way.
Personally, for most cases, I think we should try to use the slice way (the f4
way) to get the best performance, because generally slices are optimized
better than arrays by the official standard Go compiler.
// avoid-indirects_test.go
package pointers
import "testing"
//go:noinline
func f(sum *int, s []int) {
for _, v := range s { // line 8
*sum += v // line 9
}
}
//go:noinline
func g(sum *int, s []int) {
var n = *sum
for _, v := range s { // line 16
n += v // line 17
}
*sum = n
}
This is not a compiler flaw. In fact, the f and g functions are not equivalent
(though for most use cases in practice, their results are the same). For
example, if they are called like the following code shows, then they return
different results (thanks to skeeto@reddit for making this correction).
{
var s = []int{1, 1, 1}
var sum = &s[2]
f(sum, s)
println(*sum) // 6
}
{
var s = []int{1, 1, 1}
var sum = &s[2]
g(sum, s)
println(*sum) // 4
}
//go:noinline
func h(s []int) int {
var n = 0
for _, v := range s {
n += v
}
return n
}
For example, in the following code, the function g is much performant than
the function f .
package structs
import "testing"
const N = 1000
type T struct {
x int
}
//go:noinline
func f(t *T) {
t.x = 0
for i := 0; i < N; i++ {
t.x += i
}
}
//go:noinline
func g(t *T) {
var x = 0
for i := 0; i < N; i++ {
x += i
}
t.x = x
}
var t = &T{}
The function g uses a local variable x to store the sum value and assigns
the sum value to the struct field in the end. The official standard Go
compiler is smart enough to only generate register-processing assembly
instructions for the loop of the function g .
//go:noinline
func h(t *T) {
x := &t.x
for i := 0; i < N; i++ {
*x += i
}
}
package arrays
import "testing"
type T [1000]byte
var x T
var r bool
By using the -S compiler option, we could find that the compile generates
less instructions for the function CompareWithGlobalVar than the function
CompareWithLiteral . That is why the function CompareWithGlobalVar is
more performant.
For small-size arrays, the performance difference between the two functions
is small.
Please note that future compiler versions might be improved to remove the
performance difference between the two functions.
y = make([]T, n)
copy(y, x) // assume the return value is m
y = make([]T, len(x))
copy(y, x)
In other words, the optimization only works for the first case in the
following code:
var s = make{[]T, 10000)
// case 1:
y = make([]T, len(s)) // works
copy(y, s)
// case 2:
y = make([]T, len(s)) // not work
_ = copy(y, s)
// case 3:
y = make([]T, len(s)) // not work
f(copy(y, s))
// case 4:
y = make([]T, len(s), len(s)) // not work
copy(y, s)
// case 5:
var a = [1][]T{s}
y = make([]T, len(a[0])) // not work
copy(y, a[0])
// case 6:
type SS struct {x []T}
var ss = SS{x: s}
y = make([]T, len(ss.x)) // not work
copy(y, ss.x)
The capacity of the result of a make call is exactly the argument passed to
the make call. For example, cap(make([]T, n) == n and cap(make([]T,
n, m) == m . This means there might be some bytes are wasted in the
memory block hosting the elements of the result.
If an append call needs to allocate, then the capacity of the result slice of
the append call is unspecified. The capacity is often larger than length of
the result slice. Assume the result of the append call is assigned to a slice
s, then the elements within s[len(s):cap(s)] will get zeroed in the
append call. The other elements will be overwritten by the elements of the
argument slices. For example, in the following code, the elements within
s[len(x)+len(y):] will get zeroed in the append call.
s = append(x, y...)
The v1.17 algorithm has a drawback: the capacity of the result slice doesn't
increase monotonically with the length of the first parameter of the append
function. An example:
package main
func main() {
x1 := make([]int, 897)
x2 := make([]int, 1024)
y := make([]int, 100)
println(cap(append(x1, y...)))
println(cap(append(x2, y...)))
}
2048
1280
The new algorithm in Go 1.18+ often allocates less memory than the old
one in Go 1.17.
Please note, each slice growth needs one memory allocation. So we should
try to grow slices with less times in programming.
Another subtle difference (up to Go toolchain 1.22) between the copy and
append functions is that the copy function will not copy elements when it
detects that the addresses of the first elements of its two slice parameters are
identical, yet the append function never performs such detections. This
means, in the following code, the copy call is much more efficient than the
append call.
An example:
package main
func main() {
x := make([]byte, 100, 500)
y := make([]byte, 500)
a := append(x, y...)
b := append(x[:len(x):len(x)], y...)
println(cap(a)) // 1024
println(cap(b)) // 640
}
The outputs shown as comments are for Go 1.17. For Go 1.18, instead, the
above program prints:
896
640
Surely, if we confidently know that the free capacity of the first argument
slice of an append call is enough to hold all appended elements, then we
should not clip the first argument.
// way 2
func Grow_Oneline(x []T, c int) []T {
return append(x, make([]T, c - len(x))...)[:len(x)]
}
Both of the two ways are specially optimized by the official standard Go
compiler. As mentioned above, the make call in way 1 doesn't reset
elements within r[:len(x)] . In way 2, the make call doesn't make
allocations at all.
In theory, with the two optimizations, the two ways have comparable
performance. But benchmark results often show way 1 is a little more
performant.
type S []T
Clone slices
Since Go toolchain version 1.15, the most efficient way to clone a slice is
the make+copy way:
For many cases, the make+copy way is a little faster than the following
append way, because as mentioned above, an append call might allocate
and zero some extra elements.
For example, in the following code, 8191 extra elements are allocated and
zeroed.
x := make([]byte, 1<<15+1)
y := append([]byte(nil), x...)
println(cap(y) - len(x)) // 8191
If the element orders of the merged slice are important, we could use the
following two ways to merge the slice x and y (assume the length of y is
not zero).
The append way is clean but it is often a little slower, because the append
function often allocates and zeroes some extra elements. But if the length of
y is much larger than the length of x , then the append way is probably
faster, because the elements within merged[len(x):] are (unnecessarily)
zeroed in the make+copy way (then overridden by the elements of y ). So,
which way is more performant depends on specific situations.
If the element orders of the merged slice are not important and the append
way is chosen, then try to pass the shorter slice as the first argument, so that
some memory might be saved. An example to show the fact:
package main
func main() {
x := make([]int, 98)
y := make([]int, 666)
a := append(x, y...)
b := append(y, x...)
println(cap(a)) // 768
println(cap(b)) // 1360
}
The outputs shown as comments are for Go 1.17. For Go 1.18, instead, the
above program prints:
768
1024
If the free element slots in slice x are enough to hold all elements of slice y
and it is allowed to let the result slice and x share elements, then
append(x, y...) is the most performant way, for it doesn't allocate.
Since Go 1.22, the Concat function in the slices standard package exactly
does the job.
If the free capacity of the base slice is large enough to hold all the inserted
elements, and it is allowed to let the result slice and the base slice share
elements, then the following way is the most efficient, for this way doesn't
allocate.
s = s[:len(s)+len(vs)]
copy(s[i+len(vs):], s[i:])
copy(s[i:], vs)
Note: the above implementations don't consider the cases in which slice
elements might overlap. If such cases can't be ignored, please use the
Insert function in the slices standard package.
ranging over an array with two iteration variables if the size of the
array is large, because the array is copied. (Note, if only one iteration
variable is used, then the copy will not be made.)
ranging over a slice or array with two iteration variables if the element
size is large, because each element will be copied to the second
iteration variable once in the iteration process.
is equivalent to
{
aCopy := anArray // all elements are copied
var v ElementType // element type of anArray
for i := range aCopy {
v = anArray[i] // each element is copied
...
}
}
package arrays
import "testing"
//go:noinline
func sum_forrange1(s []int) int {
var n = 0
for i := range s {
n += s[i]
}
return n
}
//go:noinline
func sum_forrange2(s []int) int {
var n = 0
for _, v := range s {
n += v
}
return n
}
//go:noinline
func sum_plainfor(s []int) int {
var n = 0
for i := 0; i < len(s); i++ {
n += s[i]
}
return n
}
This optimization also works if two iteration variables present but the
second one is the blank identifier _ .
For most cases, the above code is more performant than the following code:
On my machine, the memclr way is slower only if the length of the array of
slice is smaller than 6 (element type is byte ).
Before Go toolchain v1.19, the ranged container must be an array or slice to
make this optimization work. Since Go toolchain v1.19, it may be also a
pointer to an array.
In fact, this optimization is more meaningful for slices than for arrays and
array pointers, as there is a more simple (and sometimes more performant)
way to reset array elements:
anArray = ArrayType{}
*anArrayPointer = ArrayType{}
Note: Go 1.21 added a new built-in function, clear , which may be used
to reset all the elements in a slice. So since Go 1.21, we should try to use
the clear function instead of relying on the memclr optimization to
reset slice elements.
// subslice_test.go
package arrays
import "testing"
const N = 1 << 10
var s = make([]byte, N)
var r = make([]byte, N/4)
To make the guarantee (the bytes stored in a string will never be modified),
when converting a string to byte slice, the string and the result byte slice
should not share byte elements. This means that the string content (the
bytes) will be duplicated and stored into the result byte slice (one memory
allocation is needed). Similarly, for the same reason, when converting a
byte slice to string, the byte slice content (the byte elements of the the slice)
will be duplicated and stored into the result string too (one memory
allocation is needed).
In fact, under some situations, the duplications are not necessary. The
current standard Go compiler makes several special optimizations to avoid
duplications for some simple cases. Such optimizations will be listed below.
If we use the official standard Go compiler to build the following code, the
performance difference between the function f and the function g is
neglectable. The reason is the []byte(s) conversion used in the function f
doesn't make allocations.
// string-2-bytes.go
package main
import t "testing"
const x = "abcdefghijklmnopqrstuvwxyz0123456789"
var y = "abcdefghijklmnopqrstuvwxyz0123456789"
func rangeNonConstant() {
for range []byte(y) {}
}
func convertConstant() {
_ = []byte(x)
}
func convertNonConstant() {
_ = []byte(y)
}
func main() {
stat := func(f func()) int {
allocs := t.AllocsPerRun(10, f)
return int(allocs)
}
println(
stat(rangeNonConstant),
stat(convertConstant),
stat(convertNonConstant),
)
}
Sadly, the latest compiler (v1.22.n) is still not smart enough to remove the
byte duplication in the conversions shown in the following code:
package main
import "bytes"
import t "testing"
var y = "abcdefghijklmnopqrstuvwxyz0123456789"
var s = []byte(y)
func compareNonConstants() {
_ = bytes.Compare([]byte(y), []byte(y))
}
func concatStringAndBytes() {
_ = append([]byte(y), s...)
}
func main() {
stat := func(f func()) int {
allocs := t.AllocsPerRun(10, f)
return int(allocs)
}
println(stat(compareNonConstants)) // 2
println(stat(concatStringAndBytes)) // 2
}
This optimization leads to the verbose function is more efficient than the
clean function shown in the following code (as of the official standard Go
compiler v1.22.n):
package main
import t "testing"
func main() {
x := []byte{1023: 'x'}
y := []byte{1023: 'y'}
z := []byte{1023: 'z'}
stat := func(f func(x, y, z []byte)) int {
allocs := t.AllocsPerRun(10, func() {
f(x, y, z)
})
return int(allocs)
}
println(stat(verbose)) // 0
println(stat(clean)) // 3
}
From the outputs, we could get that the the verbose function doesn't make
allocations but the simple function makes three ones, which is just the
reason why the former one is more performant.
We could also use the bytes.Compare function to compare two byte slices.
The bytes.Compare function way is often more performant for the cases in
which three-way comparisons (like the following code shows) are needed.
package main
import t "testing"
var m = map[string]int{}
var key = []byte{'k', 'e', 'y'}
var n int
func get() {
n = m[string(key)]
}
func inc() {
m[string(key)]++
}
func set() {
m[string(key)] = 123
}
func main() {
stat := func(f func()) int {
allocs := t.AllocsPerRun(10, f)
return int(allocs)
}
println(stat(get)) // 0
println(stat(set)) // 1
println(stat(inc)) // 1
}
package main
import t "testing"
type T struct {
a int
b bool
k [2]string
}
var m = map[T]int{}
var key = []byte{'k', 'e', 'y', 99: 'z'}
var n int
func get() {
n = m[T{k: [2]string{1: string(key)}}]
}
func main() {
print(int(t.AllocsPerRun(10, get))) // 0
}
package main
import t "testing"
var m1 = map[string]int{"key": 0}
func modify1() {
m1[string(key)]++
// (logically) equivalent to:
// m1[string(key)] = m1[string(key)] + 1
}
func main() {
stat := func(f func()) int {
allocs := t.AllocsPerRun(10, f)
return int(allocs)
}
println(stat(modify1)) // 1
println(stat(modify2)) // 0
}
So if the entries of a map are seldom deleted but the elements of the map
are modified frequently, it is best to use a pointer type as the map element
type.
In the following example, the function f (the a bit more verbose one) is
much more efficient than the function g for most cases (as of the Go
toolchain v1.22.n).
package main
import "testing"
var x string
func stat(add func() string) int {
c := func() {
x = add()
}
allocs := testing.AllocsPerRun(10, c)
return int(allocs)
}
func main() {
println(stat(f)) // 1
println(stat(g)) // 3
}
Please note that, currently (Go toolchain 1.22 versions), this optimization is
only useful for byte slices with lengths larger than 32. If we change the
length of the string s to 32 (by declaring it with var s = []byte{31:
'x'} ), then the performance difference between the functions f and g will
become neglectable. Please read the next section for the reason.
The a bit verbose way actually has a drawback: it wastes at least one byte
more memory. If, at coding time, we know the byte value at a specified
index of one operand, the this drawback could be avoided. For example,
assume we know the first byte of the first operand is always $ , then we
could modify the a bit verbose way as the following code shows, to avoid
wasting more memory.
In fact, recall that the example in the last section uses a byte slice with 33
bytes, the reason is to avoid allocating the byte elements of the string
concatenation operands on stack.
In the following program, the function g needs 3 heap allocations, but the
function f needs none. The only differences between the two functions are
the lengths of the involved byte slice and strings. The function f actually
makes 3 stack allocations, but the function testing.AllocsPerRun only
counts heap allocations.
package main
import "testing"
func f() {
x := str + str // does not escape
y := []byte(x) // does not escape
println(len(y), cap(y)) // 32 32
z := string(y) // does not escape
println(len(x), len(z)) // 32 32
}
func g() {
x := str + str + "x" // does not escape
y := []byte(x) // does not escape
println(len(y), cap(y)) // 33 48
z := string(y) // does not escape
println(len(x), len(z)) // 33 33
}
func main() {
println(stat(f)) // 0
println(stat(g)) // 3
}
package bytes
import "testing"
For the above two reasons, in theory, the + operator way should be always
more performant than the strings.Builder way. However, for a runtime
implementation imperfection (in Go 1.15/1.16/1.17), sometimes the +
operator way might be a little slower than the strings.Builder way. This
imperfection has been fixed since Go 1.18.
import "testing"
import "strings"
var s1 = strings.Repeat("a", M)
var s2 = strings.Repeat("a", N)
var s3 = strings.Repeat("a", K)
var r1, r2 string
func init() {
println("======", M, N, K)
}
//go:noinline
func Concat_WithPlus(a, b, c string) string {
return a + b + c
}
//go:noinline
func Concat_WithBuilder(ss ...string) string {
var b strings.Builder
var n = 0
for _, s := range ss {
n += len(s)
}
b.Grow(n)
for _, s := range ss {
b.WriteString(s)
}
return b.String()
}
====== 12 16 32
Benchmark_Concat_WithPlus-4 192.2 ns/op
Benchmark_Concat_WithBuilder-4 196.8 ns/op
There is actually a third way to concatenate strings: build a byte slice, then
copy the bytes from the concatenated strings to the byte slice, finally
convert the byte slice to the result string.
The byte slice way is almost the same as the implementation of the
strings.Builder way, except that the final conversion (from byte slice to
string) needs a memory allocation, whereas the strings.Builder way uses
unsafe mechanism to avoid the allocation. So in theory, the byte slice way
should be always slower than the the strings.Builder way. However, if
the byte slice used in the byte slice way is created with a constant capacity,
then its elements might be allocated on stack, which is an advantage over
the strings.Builder way. The advantage might make the byte slice way
faster than the above introduced two ways for some use cases.
package bytes
import "testing"
//go:noinline
func Concat_WithPlus(a, b, c, d string) string {
return a + b + c + d
}
//go:noinline
func Concat_WithBytes(ss ...string) string {
var n = 0
for _, s := range ss {
n += len(s)
}
var bs []byte
if n > 64 {
bs = make([]byte, 0, n) // escapes to heap
} else {
bs = make([]byte, 0, 64) // does not escape
}
for _, s := range ss {
bs = append(bs, s...)
}
return string(bs)
}
func Benchmark_Concat_WithPlus(b *testing.B) {
for i := 0; i < b.N; i++ {
r = Concat_WithPlus(s, s, s, s)
}
}
We could also modify the byte elements of the slice as needed before
converting the slice to the final string. This is another benefit comparing to
the other two ways.
Generlaly, if the length of the string is much larger than the byte slice, then
the verbose way is more performant. On the contrary, if the length of the
byte slice is much larger than the string, then the one-line way is more
performant.
package strings
package bytes
strings.Compare(x, y) == 0
strings.Compare(x, y) != 0
x == y
x != y
In the following example, the function f is more efficient than the function
g . The reason is the former makes one allocation, whereas the latter makes
two.
In the following code, using array values as the map keys is more
performant than using strings as the map keys, because the former way
doesn't allocate in building entry keys.
package bytes
import "testing"
var ma = make(map[[2]string]struct{})
var ms = make(map[string]struct{})
We could also use struct values as the map keys, which should be as
performant as using array keys.
The third example, which shows the performance difference between two
ways of string comparisons by ignoring cases.
package bytes
import "testing"
import "strings"
var ss = []string {
"AbcDefghijklmnOpQrStUvwxYz1234567890",
"abcDefghijklmnopQRSTuvwXYZ1234567890",
"aBcDefgHIjklMNOPQRSTuvwxyz1234567890",
}
The io.Writer type in the standard library only has one method:
Write([]byte) (int, error) . When a string needs to be written, it must
be converted to a byte slice before being passed to the Write method. This
is quite inefficient. In the following code, a BytesWriter type is
implemented to support writing strings without converting the strings to
byte slices.
package bytes
import "testing"
import "io"
From the benchmark results, we could find that the BytesWriter way is
much more performant than the general io.Writer way, because the
former way doesn't allocate (except the single buffer allocation).
Please note, there is a type in the standard package, bufio.Writer , which
acts like the BytesWriter type. Generally, we should use that type instead.
BCE (Bound Check Eliminate)
Go is a memory safe language. In array/slice/string element indexing and
subslice operations, Go runtime will check whether or not the involved
indexes are out of range. If an index is out of range, a panic will be
produced to prevent the invalid index from doing harm. This is called
bounds checking.
Bounds checks make our code run safely, on the other hand, they also make
our code run a little slower. This is a trade-off a safe language must made.
The following will list some examples to show in which cases BCE works
and in which cases BCE doesn't work.
Example 1
A simple example:
// example1.go
package main
func f1a(s []struct{}, index int) {
_ = s[index] // line 5: Found IsInBounds
_ = s[index]
_ = s[index:]
_ = s[:index+1]
}
func main() {}
The outputs show that only two code lines needs bound checks in the above
example code.
Note that: * Go toolchains with version smaller than 1.19 fail to remove the
unnecessary bound check in the f1e function. * Go toolchains with version
smaller than 1.21 fail to remove the unnecessary bound check in the f1g
function.
And note that, up to now (Go toolchain v1.22.n), the official standard
compiler doesn't check BCE for an operation in a generic function if the
operation involves type parameters and the generic function is never
instantiated. For example, the command go run -gcflags=-
d=ssa/check_bce bar.go will report nothing.
// bar.go
package bar
// var _ = foo[bool]
However, if the variable declaration line is enabled, then the compiler will
report:
Example 2
All the bound checks in the slice element indexing and subslice operations
shown in the following example are eliminated.
// example2.go
package main
func main() {}
Run it, we will find that nothing is outputted. Yes, the official standard Go
compiler is so clever that it finds all bound checks may be removed in the
above example code.
There are still some small imperfections. If we modify the f2g and f2h
functions as that shown in the following code, then the compiler (v1.22.n)
fails to remove the bound checks for the two subslice operations.
// example2b.go
package main
func main() {}
We may give the compiler some hints by turning on the comment lines to
remove these bound checks.
Example 3
We should try to evaluate the element indexing or subslice operation with
the largest index as earlier as possible to reduce the number of bound
checks.
// example3.go
package main
func main() {
}
From the output, we could learn that there are 4 bound checks in the f3a
function, but only one in the f3b function.
Example 4
Since Go toolchain v1.19, the bould check in the f5a function is
successfully removed,
In the following example, by adding one hint line in the function f4b , all
bound checks in the loop are eliminated.
Please note that, as of Go toolchain v1.22.n, the two hints used in the f4c
and f4d functions are not workable (but they should).
type T = string
// a successful hint
if len(x) > len(y) {
panic("unreachable")
}
The above hint works when T is either a string type or a slice type, whereas
each of the following two hints only works for one case (as of Go toolchain
v1.22.n).
Please note that, the future versions of the standard official Go compiler
will become smarter so that the above hints will become unnecessary later.
However, please note that, there might be some other factors which will
affect program performances. On my machine (Intel i5-4210U CPU @
1.70GHz, Linux/amd64), among the above 3 functions, the function f7b is
actually the least performant one.
the f8z function is the most performant one (in line with expectation)
but the f8y function is as performant as the f8x function
(unexpected).
var s = make([]int, 5)
func fa0() {
for i := range s {
s[i] = i // Found IsInBounds
}
}
func fa1() {
s := s
for i := range s {
s[i] = i
}
}
Arrays are often more BCE-friendly than slices. In the following code, the
array version functions ( fb2 and fc2 ) don't need bound checks.
Please note that, the future versions of the standard official Go compiler
will become smarter so that some BCE-unfriendly code might become
BCE-friendly later.
aMap = nil
// or
aMap = make(map[K]V)
If the backing array of the map is not referenced elsewhere, then the
backing array will be collected eventually after being released.
If there will be many new entries to be put in the map after it is cleared,
then the former way is preferred; otherwise, the latter (release) ways are
preferred.
Note: since Go 1.21, there is a better way to do this job. Go 1.21 introduced
a new built-in function, clear , which may be used to clear all entries in a
map, including those ones with NaN keys.
package maps
import "testing"
var m = map[int]int{}
Pointers in maps
If the key type and element type of a map both don't contain pointers, then
in the scan phase of a GC cycle, the garbage collector will not scan the
entries of the map. This could save much time.
This tip is also valid for other kinds of container in Go, such as slices,
arrays and channels.
If we can make sure that the string values used in the entries of a map have
a max length and the max length is small, then we could use the array type
[N]byte to replace the string types (where N is the max string length).
Doing this will save much garbage collection scanning time if the number
of the entries in the map is very large.
For example, in the following code, the entries of mapB contain no pointers,
but the (string) keys of mapA contain pointers. So garbage collector will
skip mapB during the scan phase of a GC cycle.
And please note that, the official standard compiler makes special
optimizations on hashing map keys whose sizes are 4 or 8 bytes. So, from
the point of view of saving CPU, it is better to use map[[8]byte]V instead
of map[[5]byte]V , and it is better to use map[int32]V instead of
map[int16]V .
package maps
import "testing"
We could use an extra counter table (a slice) and let the map record indexes
to the table, to avoid making many allocations and creating many pointers,
as the following code shows:
Please note that the above benchmark results show the latter two ways both
make zero allocations. This is actually not true. It is just that each of latter
two benchmark runs makes less than one allocation averagely, which is
truncated to zero. This is a deliberate design of the benchmark reports in the
standard packages.
// Package-level maps.
var boolToInt = map[bool]int{true: 1, false: 0}
var boolToFunc = map[bool]func(){true: f, false: g}
If there are many such identical if-else blocks used in code, using maps
with bool keys will reduce many boilerplates and make code look much
cleaner. For most use cases, this is generally good. However, as of Go
toolchain v1.22.n, the map way is not very efficient from the code
execution performance view. The following benchmarks show the
performance differences.
package maps
import "testing"
//go:noiline
func f() {}
//go:noiline
func g() {}
From the benchmark results, we could get that the if-else block way is
much more performant than the map-switch way.
For the use cases which require high code performance, we can simulate a
bool-key map by using an index table to reduce if-else boilerplates, but
still keep the simplicity of the map switch way, with the help of a bool-to-
int function. The following benchmarks show how to use the index table
way.
From the above code, we could find that the uses of the index table way are
almost as clean as the map-switch way, though an extra tiny b2i function is
needed. And from the following benchmark results, we know that the index
table way is as performant as the if-else block way.
package channels
import (
"sync"
"sync/atomic"
"testing"
)
var g int32
var m sync.Mutex
func Benchmark_Mutex(b *testing.B) {
for i := 0; i < b.N; i++ {
m.Lock()
g++
m.Unlock()
}
}
From the results, we could find that using channels to concurrently increase
a value is much slower that the other synchronization ways. The atomic
way is the best.
If it is possible, we should try to not share a value between multiple
goroutines, so that we don't need do synchronizations at all for the value.
package channels
import "testing"
So we should try to limit the number of case branches within a select code
block.
The official standard Go compiler treats a select code block with only one
case branch (and without default branch) as a simple general channel
operation.
For some cases, we could merge multiple channels as one, to avoid the
performance loss on executing multi-case select code blocks. We could
use an interface type or a struct type as the channel element type to achieve
this goal. If the channel element type is interface, then we can use a type
switch to distinguish message kinds. If the channel element type is struct,
then we can check which field is set to distinguish message kinds. The
following benchmark code shows the performance differences between
these ways.
package channels
import "testing"
var vx int
var vy string
From the results, we could get that two-case select code blocks are slower
than one-case select code blocks. In other words, for some cases, merging
several channels into one is a good way to improve program performance.
package channels
import "testing"
// inline.go
package inline
From the output, we know that the compiler thinks the bar function is
inline-able, so the bar function calls within the foo function will be
automatically flattened as:
We can use double -m compiler option to show why some functions are
inline-able but others aren't. Still use the above example:
$ go build -gcflags="-m -m" inline.go
# command-line-arguments
./inline.go:4:6: can inline bar with cost 14 as: ...
./inline.go:8:6: cannot inline foo: ... cost 96 exceeds budget
80
...
From the output, we could learn that the foo function is not inline-able, for
its inline cost is 96, which exceeds the inline threshold (80, without
enabling profile-guided optimization).
Recursive functions will never get inlined. For example, the sumSquares
function shown in the following code is not inline-able.
package main
func main() {
println(sumSquares(5))
}
Besides the above rules, for various reasons, currently (v1.22.n), the official
standard Go compiler never inlines functions containing:
The official standard Go compiler might change inline strategies, the inline
cost threshold and statement inline costs from version to version. For
example, in the above example, * before v1.16, the compiler thought the
function g6 in the following code was not inline-able (but since v1.16, it
has become inline-able). * before v1.18, the compiler thought the function
g7 in the following code was not inline-able (but since v1.18, it has become
inline-able). * before v1.19, the compiler thought the function g3 in the
following code was not inline-able (but since v1.19, it has become inline-
able). * before v1.20, the compiler thought the function g2 in the following
code was not inline-able (but since v1.20, it has become inline-able).
package main
func main() {
println(plusSquare(5))
}
For example, in the following code, the call to the package-level addFunc
function variable is not inline-able, because the compiler doesn't confirm
the values of package-level variables at compile time (for compilation
speed consideration). However, it does try to confirm the values of local
variables at compile time, if it is possible. So the call to local addFunc
function (variable) will get inlined.
package main
Sometimes, we might want calls to a function to never get inlined, for study
and testing purposes, or to make a caller function of the function inline-able
(see below for an example), etc. Besides the several ways introduced above,
we could also use the go:noinline comment directive to achieve this goal.
For example, the compiler will not inline the call to the add function in the
following code, even if the add function is very simple.
package main
//go:noinline
func add(x, y int) int {
return x + y
}
func main() {
println(add(1, 2))
}
However, please note that this is not a formal way to avoid inlining. It is
mainly intended to be used in standard package and Go toolchain
developments. But personally, I think this directive will be supported in a
long term.
Besides the rules introduce above, we should know that different code
implementation ways might have different inline costs, even if the code
differences are subtle. We could make use of this fact to try different
implementation ways to find out which way has the lowest inline cost.
Let's use the first example shown above again.
// inline2.go
package inline
Another example:
// sum.go
package inline
The compiler (v1.22.n) thinks the inline costs of the sumN functions are
different, which could be verified from the following outputs:
The calculated inline costs of the three functions are 25, 20 and 19,
respectively. From the above example, we could get:
(Please note that code inline costs don't mean code execution costs. In fact,
the official standard Go compiler generate identical assembly instructions
for the above sumN functions.)
Note, since v1.18, the official standard Go compiler thinks the inline cost of
for-range loop is smaller than a plain for loop. For example, the
compiler thinks the inline cost of the following sum4 function is 11, which
is much smaller than the above plain for loops.
func sum4(s []int) (r int) {
for i := range s {
r += s[i]
}
return
}
// branches.go
package inline
The following outputs show less branches lead to lower inline costs.
// funcvalue.go
package inline
The following outputs show the inline cost of the function op2 is much
higher than the function op1 :
For example, the concat function in the following code is not inline-able,
for its inline cost is 85 (larger than the threshold 80).
var m = 0
for i := 0; i < len(bss); i++ {
m += len(bss[i])
}
var r = make([]byte, 0, m)
for i := 0; i < len(bss); i++ {
r = append(r, bss[i]...)
}
return r
}
If, in practice, most cases are concatenating two byte slices, then we could
rewrite the above code as the following shown. Now the inline cost of the
concat function becomes 74 so that it is inline-able now. That means the
hot path will be always inlined.
return concatSlow(bss...)
}
//go:noinline
func concatSlow(bss ...[]byte) []byte {
if len(bss) == 0 {
return nil
} else if len(bss) == 1 {
return bss[0]
}
var m = 0
for i := 0; i < len(bss); i++ {
m += len(bss[i])
}
var r = make([]byte, 0, m)
for i := 0; i < len(bss); i++ {
r = append(r, bss[i]...)
}
return r
}
If the inline cost of the function wrapping the code path part doesn't exceed
the inline threshold, then we should use the above introduced avoid-being-
inlined ways to prevent the function from being inline-able. Otherwise, the
rewritten concat function is still not inline-able, for the wrapped part be
automatically flattened into the rewritten function. That is why the
go:noinline comment directive is put before the rewritten concatSlow
function.
Please note that, currently (Go toolchain v1.22.n), the inline cost of a non-
inlined function call is 59. That means a function is not inline-able if it
contains 2+ non-inlined calls.
And please note that, since Go toolchain v1.18, if we replace the two plain
for loops within the original concat function with two for-range loops,
then the original function will become inline-able already. Here, for demo
purpose, we use two plain for loops.
An example:
package functions
import "testing"
const N = 100
import "testing"
type T [1<<8]byte
var r, s T
//go:noinline
func not_inline_able(x1, y1 *T) {
x, y := x1[:], y1[:]
for k := 0; k < len(T{}); k++ {
x[k] = y[k]
}
}
package functions
import "testing"
//go:noinline
func Add5_TT_T(x, y T5) (z T5) {
z.a = x.a + y.a
z.b = x.b + y.b
z.c = x.c + y.c
z.d = x.d + y.d
z.e = x.e + y.e
return
}
//go:noinline
func Add5_PPP(z, x, y *T5) {
z.a = x.a + y.a
z.b = x.b + y.b
z.c = x.c + y.c
z.d = x.d + y.d
z.e = x.e + y.e
}
func Benchmark_Add5_TT_T(b *testing.B) {
for i := 0; i < b.N; i++ {
var x, y, z T5
z = Add5_TT_T(x, y)
t5 = z
}
}
From the above results, we get that the function Add5_PPP is more efficient
than the function Add5_TT_T .
For small-size types, the benchmarks results will invert. The reason is the
official standard Go compiler specially optimizes some operations on small-
size values.
package functions
import "testing"
type T4 struct{a, b, c, d float32}
var t4 T4
//go:noinline
func Add4_TT_T(x, y T4) (z T4) {
z.a = x.a + y.a
z.b = x.b + y.b
z.c = x.c + y.c
z.d = x.d + y.d
return
}
//go:noinline
func Add4_PPP(z, x, y *T4) {
z.a = x.a + y.a
z.b = x.b + y.b
z.c = x.c + y.c
z.d = x.d + y.d
}
package functions
import "testing"
const N = 1<<12
var buf = make([]byte, N)
var r [128][N]byte
From the results, we could find that the function with a named result
performs slower. It looks this is a problem related to code inlining. If the
two if b == nil {...} lines are enabled (to prevent the calls to the two
functions from being inlined), then there is no performance difference
between the two functions. The future compiler versions might remove the
performance difference when the two functions are both inline-able.
import "testing"
const N = 1<<12
var buf = make([]byte, N)
var r [128][N]byte
An example:
package functions
import "testing"
package functions
import "testing"
var n int
func inc() {
n++
}
The reason why the function g is much more performant than the function
f is that deferred calls which are not directly in loops are specially
optimized by the official standard Go compiler. The function g wraps the
code in the loop into an anonymous function call so that the deferred call is
not directly enclosed in the loop.
Please note that, the two functions are not equivalent to each other in logic.
If this is a problem, then the anonymous function call trick should not be
used.
package main
import (
"log"
"testing"
)
func main() {
stat := func(f func()) int {
allocs := testing.AllocsPerRun(10, f)
return int(allocs)
}
var n = stat(func(){
debugPrint(h + w)
})
println(n) // 1
}
package main
import (
"log"
"testing"
)
var debugOn = false
func main() {
stat := func(f func()) int {
allocs := testing.AllocsPerRun(10, f)
return int(allocs)
}
var n = stat(func(){
_ = debugOn && debugPrint(h + w)
})
println(n) // 0
}
package main
import "strconv"
return g(&x)
}
//go:noinline
func escape(x interface{}) {
sink = x
sink = nil
}
func main() {
var a = f(100)
println(a)
}
By making use of the trick introduced in the stack and escape analysis
article, we could rewrite the function f as the following shows, to prevent
the argument x from escaping to heap.
x2 := x // x2 escapes to heap
return g(&x2)
}
Interfaces
Box values into and unbox values from interfaces
An interface value could be viewed as a box to hold at most one non-
interface value. A nil interface value holds nothing. On the contrary, a type
assertion could be viewed as a value unboxing operation.
package interfaces
import "testing"
var r interface{}
var s = "Go"
func Benchmark_BoxString(b *testing.B) {
for i := 0; i < b.N; i++ { r = s }
}
var x = []int{1, 2, 3}
func Benchmark_BoxSlice(b *testing.B) {
for i := 0; i < b.N; i++ { r = x }
}
var a = [100]int{}
func Benchmark_BoxArray(b *testing.B) {
for i := 0; i < b.N; i++ { r = a }
}
From the above benchmark results, we could get that each value boxing
operation generally needs one allocation, and the size of the allocated
memory block is the same as the size of the boxed value.
package interfaces
import "testing"
var r interface{}
var v0 struct{}
func Benchmark_BoxZeroSize1(b *testing.B) {
for i := 0; i < b.N; i++ { r = v0 }
}
var a0 [0]int64
func Benchmark_BoxZeroSize2(b *testing.B) {
for i := 0; i < b.N; i++ { r = a0 }
}
var b bool
func Benchmark_BoxBool(b *testing.B) {
for i := 0; i < b.N; i++ { r = b }
}
From the results, we could get that boxing zero-size values, boolean values
and 8-bit integer values doesn't make memory allocations, which is one
reason why such boxing operations are much faster.
The official standard Go compiler represents (the direct parts of) maps,
channels and functions as pointers internally, so boxing such values is also
as faster as boxing pointers.
package interfaces
import "testing"
var r interface{}
var p = new([100]int)
func Benchmark_BoxPointer(b *testing.B) {
for i := 0; i < b.N; i++ { r = p }
}
From the above results, we could get that boxing pointer values is very fast
and doesn't make memory allocations. This explains the reason why
declaring a method for *T is often more efficient than for T if we intend to
let the method implement an interface method.
By making use of this optimization, for some use cases, we could use a
loop-up table to convert some non-pointer values in a small set into pointer
values. For example, in the following code, we use an array to convert
uint16 values into pointers to get much lower value boxing costs.
package interfaces
import "testing"
func init() {
for i := range values {
values[i] = uint16(i)
}
}
var r interface{}
(Please note that, the results show the Box_Normal function makes zero
allocations, which is not true. The value is about 0.99 , which is truncated
to zero. The reason is boxing values within [0, 255] doesn't allocate,
which will be mentioned below.)
import "testing"
var r interface{}
const S = "Go"
func Benchmark_BoxConstString(b *testing.B) {
for i := 0; i < b.N; i++ { r = S }
}
non-constant small integer values (in range [0, 255] ) of any integer
types (expect for 8-bit ones, which have been covered in the first
optimization mentioned above).
non-constant zero values of floating-point/string/slice types.
package interfaces
import "testing"
var r interface{}
package interfaces
import "testing"
var r interface{}
package interfaces
import "testing"
var r interface{}
Benchmark results:
So, if value boxing operations are made frequently on the hot paths of code
execution, it is recommended to box values with small boxing costs.
1. box the non-interface value in the first interface value then box the
non-interface value in the second one.
2. box the non-interface value in the first interface value then assign the
first interface to the second one.
If boxing the value needs an allocation, then which way is more
performant? No doubly, the second way, which could be proved by the
following benchmark code.
package main
import "testing"
var v = 9999999
var x, y interface{}
package main
import (
"fmt"
"io"
"testing"
)
func main() {
stat := func(f func()) int {
allocs := testing.AllocsPerRun(100, f)
return int(allocs)
}
var x = "aaa"
var n = stat(func(){
// 3 allocations
fmt.Fprint(io.Discard, x, x, x)
})
println(n) // 3
var m = stat(func(){
var i interface{} = x // 1 allocation
// No allocations
fmt.Fprint(io.Discard, i, i, i)
})
println(m) // 1
}
package interfaces
import "testing"
//go:noinline
func (a Add) Do_NotInlined(x, y float64) float64 {
return x+y
}
From the benchmark results, we could get the cost of virtual table looking-
up is about 2.5ns, and (for this specified case) the performance loss caused
by not inlined is about 1.7 ns. The losses are small for most cases, but might
be non-ignorable if the code runs frequently on the hot execution paths.
This has been mentioned in the stacks and escape analysis chapter.
For example, in the standard image package, there are many At(x, y int)
color.Color and Set(x, y int, c color.Color) methods, which are
declared to implement the image/draw.Image interface. The type
color.Color is an interface type: