The journey to achieve 0 allocations


benchmark profiling allocs

Go is fast, but when doing things carefully, besides being faster is very light.

violetear’s history is not very different from many others. It is an HTTP router capable of handling any requests.

One of violetear’s main constraints is how to deal with static and dynamic URL’s. Let’s face it, if you want to replace an important component of your current architecture you should try to break as less as possible - highly trained monkeys hate changes. πŸ’

For example, the need to support this type of requests:

host.tld/static/8E605AD7-554A-450D-A72D-D0098D336E8E/127.0.0.1
        \_____/\_____________________________________________/
           |                         |
         static                   dynamic        Allow only method PUT

host.tld/static/127.0.0.1/8E605AD7-554A-450D-A72D-D0098D336E8E
        \_____/\_____________________________________________/
           |                         |
         static                   dynamic        Allow only methods GET and HEAD

Which could be represented in the router like:

uuid := `[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}`
router.AddRegex(":uuid", uuid)
router.AddRegex(":ipv4", `^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$`)
router.HandleFunc("/static/:uuid/:ipv4", handleUUID, "PUT")
router.HandleFunc("/static/:ipv4/:uuid", handleIPV$, "GET,HEAD")

As if that were not enough, also with versioning support

While implementing this, the use of regular expressions was inevitable, but by using them where they were not required, they came with a big penalty.

This is how the router was splitting the path for all requests:

var splitPathRx = regexp.MustCompile(`[^/ ]+`)

// splitPath returns an slice of the path
func (r *Router) splitPath(p string) []string {
	pathParts := splitPathRx.FindAllString(p, -1)

	// root (empty slice)
	if len(pathParts) == 0 {
		pathParts = append(pathParts, "/")
	}

	return pathParts
}

pprof ΒΆ

Go comes with “batteries included” and is very easy to start profiling the code by the use of the package pprof.

The first goal was to find the bottlenecks. The following code was used for this:

package main

import (
	"net/http"
	"net/http/pprof"

	"github.com/nbari/violetear"
)

func hello(w http.ResponseWriter, r *http.Request) {
	w.Write([]byte("hello"))
}

func main() {
	r := violetear.New()
	r.AddRegex(":word", `^\w+$`)
	r.HandleFunc("/hello", hello, "GET,HEAD")
	r.HandleFunc("/hello/:word/", hello, "GET,HEAD")
	r.HandleFunc("/*", hello, "GET,HEAD")

	// Register pprof handlers
	r.HandleFunc("/debug/pprof/*", pprof.Index)
	r.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
	r.HandleFunc("/debug/pprof/profile", pprof.Profile)
	r.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
	r.HandleFunc("/debug/pprof/trace", pprof.Trace)

	http.ListenAndServe(":8080", r)
}

Notice the pprof handlers:

// Register pprof handlers
r.HandleFunc("/debug/pprof/*", pprof.Index)
r.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
r.HandleFunc("/debug/pprof/profile", pprof.Profile)
r.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
r.HandleFunc("/debug/pprof/trace", pprof.Trace)

wrk ΒΆ

For making some random requests, wrk + lua do a nice combo, this is the command I used for testing:

$ wrk -c100 -d10 -t2 -s counter.lua http://0:8080/

And the contents of the counter.lua script:

counter = 0

request = function()
    if counter % 100 == 0 then
	    path = "/"
    else
	    path = path .. counter .. "/"
    end
    wrk.headers["X-Counter"] = counter
    counter = counter + 1
    return wrk.format(nil, path)
end

This basically will make requests in the following way:

/
/.
/./.
/././.
/./././.
/././././.
/./././././.
/././././././.
/./././././././.
/././././././././.
...

It is very basic. This script will test the router with both static and dynamic handlers - source code for this can be found here.

Once everything is setup, the router should be up, running and listening for requests, wrk is called like mentioned before and in another terminal this command is used:

go tool pprof http://0:8080/debug/pprof/heap

After running the command you enter into an interactive mode. There I typed top and got this output:

 1(pprof) top
 2Showing nodes accounting for 4694.50kB, 100% of 4694.50kB total
 3Showing top 10 nodes out of 25
 4      flat  flat%   sum%        cum   cum%
 5 3155.84kB 67.22% 67.22%  3155.84kB 67.22%  regexp.(*bitState).reset /usr/local/Cellar/go/1.9/libexec/src/regexp/backtrack.go
 6     514kB 10.95% 78.17%      514kB 10.95%  bufio.NewWriterSize /usr/local/Cellar/go/1.9/libexec/src/bufio/bufio.go
 7  512.62kB 10.92% 89.09%   512.62kB 10.92%  regexp.(*Regexp).FindAllString.func1 /usr/local/Cellar/go/1.9/libexec/src/regexp/regexp.go
 8  512.03kB 10.91%   100%   512.03kB 10.91%  syscall.anyToSockaddr /usr/local/Cellar/go/1.9/libexec/src/syscall/syscall_bsd.go
 9         0     0%   100%  3668.46kB 78.14%  github.com/nbari/violetear.(*Router).ServeHTTP /Users/nbari/projects/go/src/github.com/nbari/violetear/violetear.go
10         0     0%   100%  3668.46kB 78.14%  github.com/nbari/violetear.(*Router).splitPath /Users/nbari/projects/go/src/github.com/nbari/violetear/violetear.go
11         0     0%   100%   512.03kB 10.91%  internal/poll.(*FD).Accept /usr/local/Cellar/go/1.9/libexec/src/internal/poll/fd_unix.go
12         0     0%   100%   512.03kB 10.91%  internal/poll.accept /usr/local/Cellar/go/1.9/libexec/src/internal/poll/sys_cloexec.go
13         0     0%   100%   512.03kB 10.91%  main.main /Users/nbari/projects/go/src/github.com/nbari/go-sandbox/profile-violetear/profile/main.go
14         0     0%   100%   512.03kB 10.91%  net.(*TCPListener).AcceptTCP /usr/local/Cellar/go/1.9/libexec/src/net/tcpsock.go

Notice lines 7 and 10

Still in the interactive mode, I typed: list splitPath which helps to show the lines of code. This was the output:

 1(pprof) list splitPath
 2Total: 2.02MB
 3ROUTINE ======================== github.com/nbari/violetear.(*Router).splitPath in /Users/nbari/projects/go/src/github.com/nbari/violetear/violetear.go
 4         0     1.02MB (flat, cum) 50.37% of Total
 5         .          .    261:   return
 6         .          .    262:}
 7         .          .    263:
 8         .          .    264:// splitPath returns an slice of the path
 9         .          .    265:func (r *Router) splitPath(p string) []string {
10         .     1.02MB    266:   pathParts := splitPathRx.FindAllString(p, -1)
11         .          .    267:
12         .          .    268:   // root (empty slice)
13         .          .    269:   if len(pathParts) == 0 {
14         .          .    270:           pathParts = append(pathParts, "/")
15         .          .    271:   }
16(pprof)

This showed that splitting the path using a regular expression was inefficient. I changed the function to something like this:

func (r *Router) splitPath(p string) []string {
        pathParts := strings.FieldsFunc(p, func(c rune) bool {
                return c == '/'
        })
        // root (empty slice)
        if len(pathParts) == 0 {
                return []string{"/"}
        }
        return pathParts
}

I created a small bench test:

package violetear

import (
	"net/http"
	"testing"
)

func benchRequest(b *testing.B, router http.Handler, r *http.Request) {
	w := &ResponseWriter{}
	u := r.URL
	rq := u.RawQuery
	r.RequestURI = u.RequestURI()
	b.ReportAllocs()
	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		u.RawQuery = rq
		router.ServeHTTP(w, r)
	}
}

func BenchmarkRouterStatic(b *testing.B) {
	router := New()
	router.HandleFunc("/hello", func(w http.ResponseWriter, r *http.Request) {}, "GET,HEAD")
	r, _ := http.NewRequest("GET", "/hello", nil)
	benchRequest(b, router, r)
}

func BenchmarkRouterDynamic(b *testing.B) {
	router := New()
	router.AddRegex(":word", `^\w+$`)
	router.HandleFunc("/test/:word", func(w http.ResponseWriter, r *http.Request) {}, "GET,HEAD")
	r, _ := http.NewRequest("GET", "/test/foo", nil)
	benchRequest(b, router, r)
}

And run it using:

go test -run=^$ -bench=.

The output was a small win of 3 allocation less πŸŽ‰

 1$ go test -run=^$ -bench=.
 22017/10/02 18:47:23 Adding path: /hello [GET,HEAD]
 3goos: darwin
 4goarch: amd64
 5pkg: github.com/nbari/violetear
 6BenchmarkRouterStatic-4         2017/10/02 18:47:23 Adding path: /hello [GET,HEAD]
 72017/10/02 18:47:23 Adding path: /hello [GET,HEAD]
 82017/10/02 18:47:23 Adding path: /hello [GET,HEAD]
 9 2000000               692 ns/op             616 B/op         10 allocs/op
102017/10/02 18:47:25 Adding path: /test/:word [GET,HEAD]
11BenchmarkRouterDynamic-4        2017/10/02 18:47:25 Adding path: /test/:word [GET,HEAD]
122017/10/02 18:47:25 Adding path: /test/:word [GET,HEAD]
132017/10/02 18:47:25 Adding path: /test/:word [GET,HEAD]
14 1000000              1256 ns/op             936 B/op         12 allocs/op
15PASS
16ok      github.com/nbari/violetear      3.380s

Before doing the changes this was the output: 11 allocations for static routes and 14 for dynamic 🐒

 1$ go test -run=^$ -bench=.
 22017/10/02 09:56:34 Adding path: /hello [GET,HEAD]
 3goos: darwin
 4goarch: amd64
 5pkg: github.com/nbari/violetear
 6BenchmarkRouterStatic-4         2017/10/02 09:56:34 Adding path: /hello [GET,HEAD]
 72017/10/02 09:56:34 Adding path: /hello [GET,HEAD]
 82017/10/02 09:56:34 Adding path: /hello [GET,HEAD]
 91000000              1145 ns/op             776 B/op         11 allocs/op
102017/10/02 09:56:35 Adding path: /test/:word [GET,HEAD]
11BenchmarkRouterDynamic-4        2017/10/02 09:56:35 Adding path: /test/:word [GET,HEAD]
122017/10/02 09:56:35 Adding path: /test/:word [GET,HEAD]
132017/10/02 09:56:35 Adding path: /test/:word [GET,HEAD]
141000000              1835 ns/op            1096 B/op         14 allocs/op
15PASS
16ok      github.com/nbari/violetear      3.029s

To get more info about the allocations, I mainly used pprof with options like alloc_space:

go tool pprof --alloc_space  http://0:8080/debug/pprof/heap

When setting allocfreetrace=1 to environment variable GODEBUG, you can see stacktrace where allocation occurs:

allocfreetrace: setting allocfreetrace=1 causes every allocation to be
profiled and a stack trace printed on each object's allocation and free.

To build the tests:

go test -c

And to run the tests writing to a file all the output:

GODEBUG=allocfreetrace=1 ./test.test -test.run=none -test.bench=BenchmarkRouter -test.benchtime=10ms 2>trace.log

After doing all this I was available to save up to 90% of the allocations for static routes, from 11 to 1:

 1$ go test -run=^$ -bench=.
 22017/10/02 09:51:47 Adding path: /hello [GET,HEAD]
 3goos: darwin
 4goarch: amd64
 5pkg: github.com/nbari/violetear
 6BenchmarkRouterStatic-4         2017/10/02 09:51:47 Adding path: /hello [GET,HEAD]
 72017/10/02 09:51:47 Adding path: /hello [GET,HEAD]
 82017/10/02 09:51:47 Adding path: /hello [GET,HEAD]
 92017/10/02 09:51:47 Adding path: /hello [GET,HEAD]
1010000000               213 ns/op              16 B/op          1 allocs/op
112017/10/02 09:51:50 Adding path: /test/:word [GET,HEAD]
12BenchmarkRouterDynamic-4        2017/10/02 09:51:50 Adding path: /test/:word [GET,HEAD]
132017/10/02 09:51:50 Adding path: /test/:word [GET,HEAD]
142017/10/02 09:51:50 Adding path: /test/:word [GET,HEAD]
151000000              1115 ns/op             816 B/op          7 allocs/op
16PASS
17ok      github.com/nbari/violetear      3.507s

go-torch πŸ”₯ ΒΆ

To have a better graphical representation about the profiling output go-torch is an excellent tool for creating flame graphs.

The easiest way to run it is like this:

$ docker run uber/go-torch -u http://[address-of-host] -p > torch.svg

user the IP/address of your host, not localhost or 127.0.0.1

This is an example of how it was behaving the router before:

torch before

After: torch after

0 allocations

Make it happen πŸ˜ƒ

 1$ go test -run=^$ -bench=.
 22017/10/05 21:57:31 Adding path: /hello [GET,HEAD]
 3goos: darwin
 4goarch: amd64
 5pkg: github.com/nbari/violetear
 6BenchmarkRouterStatic-4         2017/10/05 21:57:31 Adding path: /hello [GET,HEAD]
 72017/10/05 21:57:31 Adding path: /hello [GET,HEAD]
 82017/10/05 21:57:31 Adding path: /hello [GET,HEAD]
 92017/10/05 21:57:32 Adding path: /hello [GET,HEAD]
1010000000               122 ns/op               0 B/op          0 allocs/op
112017/10/05 21:57:33 Adding path: /test/:word [GET,HEAD]
12BenchmarkRouterDynamic-4        2017/10/05 21:57:33 Adding path: /test/:word [GET,HEAD]
132017/10/05 21:57:33 Adding path: /test/:word [GET,HEAD]
142017/10/05 21:57:33 Adding path: /test/:word [GET,HEAD]
152017/10/05 21:57:34 Adding path: /test/:word [GET,HEAD]
16 2000000               913 ns/op             784 B/op          6 allocs/op
17PASS
18ok      github.com/nbari/violetear      4.144s

The router since always has been behaving quite well, this mainly thanks to Go itself and well, to the modest traffic it has been handling.

While doing some benchmarks and simulating some DDOS I found some patterns. All were related to the way the request.URL.Path was handled. The thing I noticed was that the bigger the request path was the more memory allocations were used. At the end was pretty much making sense since by splitting the path by / a slice was returned with all the possible parts of the path.

For example, for a request like:

/the/quick/brown/fox/jumps/over/the/lazy/dog

The router was creating a slice of 9 items:

[ the quick brown fox jumps over the lazy dog]

In many cases, this would never be defined as routes and could probably just be served by a catch-all(*) handler.

To fix this, instead of splitting and returning a slice, the path is now parsed. Each component of the path from left to right is used to find within the set of defined handlers; so for the same request shown before, the router converts it to:

key := "the"
path := "/quick/brown/fox/jumps/over/the/lazy/dog"

It searches for a possible handler named the and if found then next will try again using quick and shorting the path to /brown/fox/jumps/over/the/lazy/dog:

key := "quick"
path := "/brown/fox/jumps/over/the/lazy/dog"

key := "brown"
path := "/fox/jumps/over/the/lazy/dog"

key := "fox"
path := "/jumps/over/the/lazy/dog"

And so on until a handler matches the key or a 404 is returned.

This by principle made the router more efficient but now the challenge was how to parse the path without creating allocations.

The tricky part was to avoid concatenating strings runes or bytes, by just using the existing path and return chunks of the slice did the trick, the current code look like is this:

 1func (t *Trie) SplitPath(path string) (string, string) {
 2	var key string
 3	if path == "" {
 4		return key, path
 5	} else if path == "/" {
 6		return path, ""
 7	}
 8	for i := 0; i < len(path); i++ {
 9		if path[i] == '/' {
10			if i == 0 {
11				return t.SplitPath(path[1:])
12			}
13			if i > 0 {
14				key = path[:i]
15				path = path[i:]
16				if path == "/" {
17					return key, ""
18				}
19				return key, path
20			}
21		}
22	}
23	return path, ""
24}

The router now is faster and static routes are 0% fat πŸš€.

There is still work in progress - who would think that behind a simple router there is too much logic involved. The experience so far has been very pleasant, learning and improving every day more about Go and its niceties.

To make an even better improvement, any tips, comments or a review will be more than welcome πŸ™

comments powered by Disqus