doujiu8145 2019-07-16 04:04
浏览 60

使用Go标准库,为什么在这种两层体系结构中不断泄漏TCP连接?

In this situation, I'm using all standard Go libraries -- net/http, most importantly.

The application consists of two layers. The first layer is the basic web application. The web application serves out the UI, and proxies a bunch of API calls back to the second layer based on username -- so, it's effectively a load balancer with consistent hashing -- each user is allocated to one of these second-layer nodes, and any requests pertaining to that user must be sent to that particular node.

Quick details

These API endpoints in the first layer effectively read in a JSON body, check the username, use that to figure out which of the layer 2 nodes to send the JSON body to, and then it sends it there. This is done using a global http.Client that has timeouts set on it, as appropriate.

The server side does a defer request.Body.Close() in each of the handlers after ensuring no error comes back from decoder.Decode(&obj) calls that unmarshal the JSON. If there is any codepath where that could happen, it isn't one that's likely to get followed very often.

Symptoms

On the node in the second layer (the application server) I get log lines like this because it's leaking sockets presumably and sucking up all the FDs:

2019/07/15 16:16:59 http: Accept error: accept tcp [::]:8100: accept4: too many open files; retrying in 1s
2019/07/15 16:17:00 http: Accept error: accept tcp [::]:8100: accept4: too many open files; retrying in 1s

And, when I do lsof 14k lines are output, of which 11,200 are TCP sockets. When I look into the contents of lsof, I see that nearly all these TCP sockets are in connection state CLOSE_WAIT, and are between my application server (second layer node) and the web server (the first layer node).

Interestingly, nothing seems to go wrong with the web application server (layer 1) during this timeframe.

Why does this happen?

I've seen lots of explanations, but most either point out that you need to specify custom defaults on a custom http.Client and not use the default, or they tell you to make sure to close the request bodies after reading from them in the layer 2 handlers.

Given all this information, does anyone have any idea what I can do to at least put this to bed once and for all? Everything I search on the internet is user error, and while I certainly hope that's the case here, I worry that I've nailed down every last quirk of the Go standard library I can find.

Been having trouble nailing down exactly how long it takes for this to happen -- the last time it happened, it was up for 3 days before I started to see this error, and at that point obviously nothing recovers until I kill and restart the process.

Any help would be hugely appreciated!

EDIT: example of client-side code

Here is an example of what I'm doing in the web application (layer 1) to call the layer 2 node:


var webHttpClient = &http.Client{
    Transport: &http.Transport{
        MaxIdleConnsPerHost: MaxIdleConnections,
    },
    Timeout: time.Second * 20,
}
// ...
                    uri := fmt.Sprintf("http://%s/%s", tsUri, "pms/all-venue-balances")
                    req, e := http.NewRequest("POST", uri, bytes.NewBuffer(b))
                    resp, err := webHttpClient.Do(req)
                    if err != nil {
                        log.Printf("Submit rebal error 3: %v
", err)
                        w.WriteHeader(500)
                        return
                    }
                    defer resp.Body.Close()

                    body, _ := ioutil.ReadAll(resp.Body)
                    w.WriteHeader(200)
                    w.Write(body)
  • 写回答

0条回答

    报告相同问题?

    悬赏问题

    • ¥15 HFSS 中的 H 场图与 MATLAB 中绘制的 B1 场 部分对应不上
    • ¥15 如何在scanpy上做差异基因和通路富集?
    • ¥20 关于#硬件工程#的问题,请各位专家解答!
    • ¥15 关于#matlab#的问题:期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707,使系统具有较小的超调量
    • ¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
    • ¥30 截图中的mathematics程序转换成matlab
    • ¥15 动力学代码报错,维度不匹配
    • ¥15 Power query添加列问题
    • ¥50 Kubernetes&Fission&Eleasticsearch
    • ¥15 報錯:Person is not mapped,如何解決?