Tigraine

Daniel Hoelbling-Inzko talks about programming

A bulkhead in Go is really just a semaphore

When looking to build a software system that's resilient in the face of failure there are a bunch of useful concepts and components that all need to work together to achieve that goal.

One of these tools is Bulkheading).

Bulkheads in traditional shipbuilding are a means to keep water that's entering the vessel in one compartment from flooding the whole ship and sinking it. Translated to software it's pretty similar: You try to compartmentalise the application so failures in one part don't adversely affect the rest of the application.

A classic example of why this is important would be a database that's acting up and starts responding slowly to queries.

By itself that would not be a problem - a slow requests would run into a timeout and the application would gracefully handle that down the line. It does become a problem though if the clients continue hammering that service with more and more queries while the database is slow. The slow responses end up blocking resources in the application and given high enough timeouts and enough incoming requests there is a real risk of the application running out of resources and crashing.

The other issue in such a scenario is that once the database starts becoming unstable/slow, adding more queries just equates to kicking someone that's already down. There is a high chance that the added queries will just make matters worse and cause a struggling database to shut down completely.

The solution to this is to introduce a maximum number of concurrent requests that the application is allowed to send to the database. Once the DB starts getting slow the incoming requests are not immediately submitted to the DB but actually have to wait until another active request is done. By putting a maximum wait time on this you can essentially limit the number of in-flight requests to a known quantity that will prevent your service from consuming all available resources and crashing. And you get to degrade the service gracefully.

Why not use a normal timeout? Timeouts are a static upper bound while latency is rarely uniform. Putting a timeout on an operation that during normal operation responds between 5ms and 10s will usually call for a timeout of 15-20 seconds depending on how generous you are. With a 20 second timeout and a quite moderate 300 operations per second you end up at a respectable 6.000 in-flight requests that tie up resources in your application. In Java-Land that would already spell doom for your application's threadpools. So in addition to maximum duration timeouts we need something more - and that something is a Bulkhead.

After having used the excellent Resilience4J library in Java to "failure-proof" a service that was having spotty collaborators we then moved on to some Go services to do the same. We expected to find a lot of libraries providing Bulkheading, but we couldn't really find one that's maintained and confidence inspiring.

So we looked at alternatives. Remembering that a Bulkhead isn't anything super fancy we looked at the Go standard library and hit gold in the golang.org/x/sync/semaphore package. Specifically the Weighted semaphore implementation is essentially all you need for a Bulkhead. A bulkhead in Go is simply a Semaphore, with all the relevant timeout features being enabled by the clever use of the context package. It doesn't come with monitoring out of the box like maybe Resilience4J does - but that's easy to layer on top and the API ends up being very simple:

sem := semaphore.NewWeighted(5) // allow 5 concurrent calls
go func() {
        ctx, _ := context.WithTimeout(context.TODO(), 1*time.Second)
        // Acquire the semaphore
        err := sem.Acquire(ctx, 1)
        if err != nil {
            // bulkhead is full and we timed out
            return
        }
        defer sem.Release(1)

        // do work
}()

As you can see since semaphore supports context we can very easily add our maximum waiting time for the bulkhead via the context.WithTimeout and we've essentially implemented a Bulkhead but with the standard library and quite straightforward idiomatic Go syntax.

Filed under go, resilience

Debugging Go IOWait Hang: Sometimes it's really not your code

If something looks like a bug in the Language Runtime, Standard Library or the Operating System I tend to always approach it with caution: It's usually a bug in my code and I'm just not seeing it.

But sometimes it's not me - it's really the compiler and you spend a solid week debugging a Go program until you find out that cross-compiling from OSX to Linux leads to a stdlib Bug that manifests itself with the whole application just hanging in IOWait loops given enough concurrency.

Obviously the whole thing was really frustrating because:

  • The bug only happened on production servers (obviously - anything else would not be fun).
  • Could only be reproduced on a large dataset of 300 million items (so every test also takes quite a while)
  • I had to test if it works without concurrency (which took 2 days and yes it did)

But the important finding from this exercise was that you can print the full stacktrace of all running Goroutines as well as their status for a running/hanging program! You just have to send the kill -ABRT signal to a process! This is similar to what you see when a panic occurs and was massively helpful in hunting down this bug. Kudos to the Go team for that.

An example for this:

package main

func main() {
  for {}
}

The program will obviously hang and do a busy loop, but if you send the kill -ABRT signal you'll get something similar to this printed to stderr:

SIGABRT: abort
PC=0x1056d70 m=0 sigcode=0

goroutine 1 [running]:
main.main()
        /Users/tigraine/projects/test/main.go:4 fp=0xc00003c788 sp=0xc00003c780 pc=0x1056d70
runtime.main()
        /usr/local/Cellar/go/1.14.1/libexec/src/runtime/proc.go:203 +0x212 fp=0xc00003c7e0 sp=0xc00003c788 pc=0x102b3f2
runtime.goexit()
        /usr/local/Cellar/go/1.14.1/libexec/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00003c7e8 sp=0xc00003c7e0 pc=0x10528f1
...
Filed under golang, go, debugging

Golang hidden gems: testing.T.Log

One thing I love about Go is it's build chain and overall ease of use. Some things take time to get used to, but the lightning fast builds and the convention-based testing Go offers are addicting right from the start.

Today I found another hidden Gem I think is just genius: testing.T.Log(). Ok I admit, not the most sexy method to get excited about - but bear with me for a moment. Imagine the following code.

func TestSomething(t *testing.T) {
  t.Log("Hello World")
}

What's the output? If you'd expect Hello World you are mistaken. The output is exactly nothing :)

testing.T.Log() only prints something if a testing.T.Error or testing.T.Fatal occurred. Brilliant! Nothing is more annoying than chatty test suites where your actual problem is buried in 2-3 megabytes of meaningless debug statements! And this solves the problem really elegantly. You can log as much debug info as you want and it will only surface if the test actually failed.

Filed under golang, go, testing

My Photography business

Projects

dynamic css for .NET

Archives

more