The first rule of cloud computing should be: Always have a health check!
Why? Well - without them your cluster will not know if the application is actually up or still starting/terminating or anywhere inbetween. As long as there are
readinessProbes Kubernetes can make sure no traffic gets routed to your app before it is really ready. And even more important: It will restart services and reschedule them once your health checks start going sideways.
But here is another insight into health checks: Do performance testing on them.
During the last couple of days I've had Kubernetes kill and restart perfectly healthy Kong Api-Gateway pods because apparently the
/status route in Kong does some pretty expensive queries on the backend. Kong apparently thinks it's cool to do a
SELECT COUNT(*) on most of it's tables to tell you how many consumers it has registered, how many oauth_tokens there are etc.. All totally irrelevant information for a health check - but it's still the only endpoint I was able to hit that would actually terminate on kong itself (anything else would also kill Kong if the upstream service is having a problem). And
/status sounded like a reasonable endpoint for health-checking.
Now with Postgres that kind of queries would not really be a terrible problem (still not good), but for Cassandra it's pretty catastrophic since it's not really meant to do aggregation queries without a partition key. Looking at the code reveals the problem - and so once there was some moderate pressure, the slow queries would time out and Kubernetes would think the Kong pod was dead (although it was still serving requests) and killed it. Yay!
So the solution here was to move away from a
httpGet liveness & readinessProbe to a exec probe. Exec probes are a one of my favorite feature Kubernetes - instead of doing Network calls to check if something is up it will just do a
docker exec and determine based on the return code of the program executed if the pod is healthy or not.
And coincidentally Kong comes with a commandline utility called
kong health that does exactly what it's named for - and is lightning fast with no database involved :).
Here is the relevant yaml configuration: