After much search, I have finally come to an acceptable and logical solution to this problem.
The root-cause is this: The underlying TCP connection is closed abruptly, but neither the gRPC Client nor Server are 'notified' of this event.
The challenge is at multiple levels:
- Kernel's management of TCP sockets
- Any intermediary load-balancers/reverse-proxies (by Cloud Providers or otherwise) and how they manage TCP sockets
- Your application layer itself and it's networking requirements - whether it can reuse the same connection for future requests not
My solution turned out to be fairly simple:
server = grpc.NewServer(
MaxConnectionIdle: 5 * time.Minute, // <--- This fixes it!
This ensures that the gRPC server closes the underlying TCP socket gracefully itself before any abrupt kills from the kernel or intermediary servers (AWS and Google Cloud Load Balancers both have larger timeouts than 5 minutes).
The added bonus you will find here is also that any places where you're using multiple connections, any leaks introduced by clients that forget to Close the connection will also not affect your server.
My $0.02: Don't blindly trust any organisation's (even Google's) ability to design and maintain API. This is a classic case of defaults-gone-wrong.