When looking to build a software system that's resilient in the face of failure there are a bunch of useful concepts and components that all need to work together to achieve that goal.
One of these tools is Bulkheading).
Bulkheads in traditional shipbuilding are a means to keep water that's entering the vessel in one compartment from flooding the whole ship and sinking it. Translated to software it's pretty similar: You try to compartmentalise the application so failures in one part don't adversely affect the rest of the application.
A classic example of why this is important would be a database that's acting up and starts responding slowly to queries.
By itself that would not be a problem - a slow requests would run into a timeout and the application would gracefully handle that down the line.
It does become a problem though if the clients continue hammering that service with more and more queries while the database is slow. The slow responses end up blocking resources in the application and given high enough timeouts and enough incoming requests there is a real risk of the application running out of resources and crashing.
The other issue in such a scenario is that once the database starts becoming unstable/slow, adding more queries just equates to kicking someone that's already down. There is a high chance that the added queries will just make matters worse and cause a struggling database to shut down completely.
The solution to this is to introduce a maximum number of concurrent requests that the application is allowed to send to the database. Once the DB starts getting slow the incoming requests are not immediately submitted to the DB but actually have to wait until another active request is done. By putting a maximum wait time on this you can essentially limit the number of in-flight requests to a known quantity that will prevent your service from consuming all available resources and crashing. And you get to degrade the service gracefully.
Why not use a normal timeout? Timeouts are a static upper bound while latency is rarely uniform. Putting a timeout on an operation that during normal operation responds between 5ms and 10s will usually call for a timeout of 15-20 seconds depending on how generous you are. With a 20 second timeout and a quite moderate 300 operations per second you end up at a respectable 6.000 in-flight requests that tie up resources in your application. In Java-Land that would already spell doom for your application's threadpools. So in addition to maximum duration timeouts we need something more - and that something is a Bulkhead.
After having used the excellent Resilience4J library in Java to "failure-proof" a service that was having spotty collaborators we then moved on to some Go services to do the same. We expected to find a lot of libraries providing Bulkheading, but we couldn't really find one that's maintained and confidence inspiring.
So we looked at alternatives. Remembering that a Bulkhead isn't anything super fancy we looked at the Go standard library and hit gold in the golang.org/x/sync/semaphore
package. Specifically the Weighted
semaphore implementation is essentially all you need for a Bulkhead. A bulkhead in Go is simply a Semaphore, with all the relevant timeout features being enabled by the clever use of the context
package. It doesn't come with monitoring out of the box like maybe Resilience4J does - but that's easy to layer on top and the API ends up being very simple:
sem := semaphore.NewWeighted(5) // allow 5 concurrent calls
go func() {
ctx, _ := context.WithTimeout(context.TODO(), 1*time.Second)
// Acquire the semaphore
err := sem.Acquire(ctx, 1)
if err != nil {
// bulkhead is full and we timed out
return
}
defer sem.Release(1)
// do work
}()
As you can see since semaphore
supports context
we can very easily add our maximum waiting time for the bulkhead via the context.WithTimeout
and we've essentially implemented a Bulkhead but with the standard library and quite straightforward idiomatic Go syntax.