In this situation, I'm using all standard Go libraries -- net/http
, most importantly.
The application consists of two layers. The first layer is the basic web application. The web application serves out the UI, and proxies a bunch of API calls back to the second layer based on username -- so, it's effectively a load balancer with consistent hashing -- each user is allocated to one of these second-layer nodes, and any requests pertaining to that user must be sent to that particular node.
Quick details
These API endpoints in the first layer effectively read in a JSON body, check the username, use that to figure out which of the layer 2 nodes to send the JSON body to, and then it sends it there. This is done using a global http.Client
that has timeouts set on it, as appropriate.
The server side does a defer request.Body.Close()
in each of the handlers after ensuring no error
comes back from decoder.Decode(&obj)
calls that unmarshal the JSON. If there is any codepath where that could happen, it isn't one that's likely to get followed very often.
Symptoms
On the node in the second layer (the application server) I get log lines like this because it's leaking sockets presumably and sucking up all the FDs:
2019/07/15 16:16:59 http: Accept error: accept tcp [::]:8100: accept4: too many open files; retrying in 1s
2019/07/15 16:17:00 http: Accept error: accept tcp [::]:8100: accept4: too many open files; retrying in 1s
And, when I do lsof
14k lines are output, of which 11,200 are TCP sockets. When I look into the contents of lsof
, I see that nearly all these TCP sockets are in connection state CLOSE_WAIT
, and are between my application server (second layer node) and the web server (the first layer node).
Interestingly, nothing seems to go wrong with the web application server (layer 1) during this timeframe.
Why does this happen?
I've seen lots of explanations, but most either point out that you need to specify custom defaults on a custom http.Client
and not use the default, or they tell you to make sure to close the request bodies after reading from them in the layer 2 handlers.
Given all this information, does anyone have any idea what I can do to at least put this to bed once and for all? Everything I search on the internet is user error, and while I certainly hope that's the case here, I worry that I've nailed down every last quirk of the Go standard library I can find.
Been having trouble nailing down exactly how long it takes for this to happen -- the last time it happened, it was up for 3 days before I started to see this error, and at that point obviously nothing recovers until I kill and restart the process.
Any help would be hugely appreciated!
EDIT: example of client-side code
Here is an example of what I'm doing in the web application (layer 1) to call the layer 2 node:
var webHttpClient = &http.Client{
Transport: &http.Transport{
MaxIdleConnsPerHost: MaxIdleConnections,
},
Timeout: time.Second * 20,
}
// ...
uri := fmt.Sprintf("http://%s/%s", tsUri, "pms/all-venue-balances")
req, e := http.NewRequest("POST", uri, bytes.NewBuffer(b))
resp, err := webHttpClient.Do(req)
if err != nil {
log.Printf("Submit rebal error 3: %v
", err)
w.WriteHeader(500)
return
}
defer resp.Body.Close()
body, _ := ioutil.ReadAll(resp.Body)
w.WriteHeader(200)
w.Write(body)