weixin_39886841
weixin_39886841
2021-01-03 10:07

Frequently seeing "connection timed out" errors

Hey Ryan,

Seeing these error (below) messages every few minutes in monstache logs.
ERROR 2019/01/17 18:35:57 Unable to dial MongoDB: dial tcp <ip>:27017: connect: connection timed out ERROR 2019/01/17 18:35:57 Unable to dial MongoDB: dial tcp <ip>:27017: connect: connection timed out

My connection settings are: [mongo-dial-settings] ssl = true read-timeout = 300 write-timeout = 300

[mongo-session-settings] socket-timeout = 300 sync-timeout = 300

Ingestion in production is also relatively slower compared to our dev tests. In about ~14hrs, only about 1.2 million documents have been transferred from mongo to ES, whereas in dev it only took a few hours to transfer 2.5+ million documents. The only difference between the tests was adding a namespace-regex in prod. Is filtering performance expensive?

Thanks!

该提问来源于开源项目:rwynn/monstache

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

35条回答

  • weixin_39886841 weixin_39886841 4月前

    Have you tested monstache with MongoDb 4.0 and above? Looking at the mongo library you're using (github.com/globalsign/mgo) support for 4.0 is experimental, so I'm not sure if thats playing a role here? (though somehow dev doesn't seem to be having any problems with this setup..)

    点赞 评论 复制链接分享
  • weixin_39886841 weixin_39886841 4月前

    Looking at the IP addresses of the timed out connections, neither of the hosts are the primary host either, so why doesn't monstache keep ingesting more from the primary?

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    I've tested briefly on MongoDB 4.0 but not in a production environment (monstache running for long periods). Do you happen to use the change-stream-namespaces option in monstache? I've noticed problems with that in my testing and I'm currently putting together a pull request to send to the mgo driver team.

    Another thought is to make sure the user in your connection has the same privileges in MongoDB on dev and prod.

    点赞 评论 复制链接分享
  • weixin_39886841 weixin_39886841 4月前

    Nope, not using change streams. Does monstache work with srv connecting strings?

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    Nope does not work with srv connection strings: https://rwynn.github.io/monstache-site/advanced/#mongodb-atlas. Are you using Atlas?

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    You might try some connection string that are supported: https://github.com/globalsign/mgo/blob/master/session.go#L236

    connect=direct maxIdleTimeMS=60000 etc.

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    If you are feeling adventurous you could compile monstache yourself with the following added:

    
    mgo.SetDebug(true)
    mgo.SetLogger(traceLog)
    
    
    cd monstache
    go install
    // binary gets placed in $GOPATH/bin
    

    That will output lots of stuff about connections. You can omit SetDebug and see much less while still getting some more info than monstache provides.

    点赞 评论 复制链接分享
  • weixin_39886841 weixin_39886841 4月前

    Does monstache stop reading from Mongo when it starts to get these errors? I realized yesterday the hosts in these error logs are in a different atlas zone than the rest of the provided hosts, so cross region connection limitations makes sense. But, I deployed monstache again yesterday, this time with hosts from the same zone, and I got the same errors for the problem hosts again. So, even though I'm providing a subset of hosts from the replica set, monstache is able to pull a list of all the hosts in the set and then uses that in DialServer, except it doesn't end up connecting to the ones in the different zone and then throws these errors.

    I also realized I never saw these errors in dev because all the atlas hosts run in the same zone as the app deployment, whereas in prod we have mongo deployed in multiple regions, but monstache only in 1. So, are these logs anything to worry about if they only spit errors related to the cross-region hosts?

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    Monstache does it's best to keep reading and retrying. It relies on the underlying mgo driver which does a periodic sync stage to get the latest information about servers from mongodb.

    For example on the retries, monstache uses gtm to interface with mongodb through the mgo driver. https://github.com/rwynn/gtm/blob/master/gtm.go From that source code you can see that when it encounters errors it does a waitForConnection then copies the session to get a new socket, and finally retries what it was doing. It does that when reading directly from collections and when tailing the oplog.

    Recovery and health is still predicated on the underlying mgo driver. That driver will do period syncs here in a loop every 30s.

    https://github.com/globalsign/mgo/blob/master/cluster.go#L365

    That's where it determines and reacts to the health of mongodb. This sync stage has different behavior if you set connect=direct in the connection string. With direct on, it will only try to connect to the servers in your connection string. By default, it will try to discover servers and use them. You may want to try connect=direct and only specify the mongodb servers in the same zone as monstache.

    So, I would say that since monstache tries to recover from these errors, you should only be worried if you see that they do not get resolved automatically and you see that syncing to Elasticsearch stops completely.

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    Also, you can run monstache in worker mode by setting the config file with:

    
    workers = ['a', 'b']
    

    And then running 2 monstache processes with

    
    monstache -worker a
    monstache -worker b
    

    Each worker will only handle 1/Workers the documents. You can try to have a worker for each zone and configure the connection string with only the hosts in the zone and connect=direct.

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    Another thing to try is the following in your connection string readPreference=nearest

    点赞 评论 复制链接分享
  • weixin_39886841 weixin_39886841 4月前

    "So, I would say that since monstache tries to recover from these errors, you should only be worried if you see that they do not get resolved automatically and you see that syncing to Elasticsearch stops completely."

    Once I start seeing connection errors pop up, I don't see any output from the Map function (which produces output on different error conditions). The mongo cluster monstache is connected to is constantly ingesting new data, so there should always be a stream of data going from mongo to monstache to ES, yet when these errors start popping up, there's no output from Map or data sent from monstache to ES (based on index sizes remaining consistent).

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    I'm not sure there is anything that can be done besides swapping out the driver for the new mongodb-go driver or configuring mgo differently since monstache continously tries to get new results in a for loop, just logging errors and trying again. If you see that those go routines are exiting let me know.

    I am currently looking into the new mongodb go driver, but that still has some issues too.

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    You might try to lower the timeouts or keep them back at their defaults. I know that seems wrong in the case that the error message says that a timeout occurred. But long timeouts might actually hurt the situation if the driver is stuck on a server that it cannot read from for some reason. If the error message is logged but the driver then picks a new server then the default timeout of 7s might be better.

    点赞 评论 复制链接分享
  • weixin_39886841 weixin_39886841 4月前

    looking at cloudwatch metrics for elasticsearch, I'm seeing a lot of 4xx responses which I can see coming from monstache bulk indexing documents already present in ES, tho I can't verify for sure because ES logs don't seem to log anything specific like request/response. there's nothing printed in the monstache logs, so I can't verify that either, unless errors from bulk writes are suppressed if there are connection issues(?)

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    Do you mean version conflict errors? I don't think those would be suppressed. They should be logged here:

    https://github.com/rwynn/monstache/blob/master/monstache.go#L370

    If version conflict, those are expected since monstache uses the oplog timestamp as the version number and sends documents on multiple concurrent connections. So it uses Elasticsearch to block and tell it if the document is already there or is actually newer than the one in the request.

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    Certain type of codes are considered retryable and will be sent again in hopes that they may succeed.

    https://github.com/olivere/elastic/blob/release-branch.v6/bulk_processor.go#L21

    
    defaultRetryItemStatusCodes = []int{408, 429, 503, 507}
    
    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    Definitely keep an eye on any errors in Elasticsearch. If they are in the set above then monstache will not drop them but instead keep them in the queue and since this presents back pressure to the ingestion from MongoDB, new events from MongoDB will not be processed.

    点赞 评论 复制链接分享
  • weixin_39886841 weixin_39886841 4月前

    are the new events in this case discarded or kept around? I'm going to test with changing the connection type to mongo and seeing if that makes a difference. Only takes about 12 or so hours for the timeouts to start popping up, which is interesting because everything works fine at deployment time. Cross region communication seems to be working for a while before randomly borking hours into the deployment.

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    the new events are not discarded. the whole thing is tied together via go channels. Without buffering you cannot write to a channel unless another go routine is reading from it. To simplify things you can think of it like 3 go routines communicating over two channels.

    There is 1 goroutine reading from MongoDB and writing to a channel A. There is another goroutine reading from channel A, processing, and writing to another channel B. The final goroutine reads from channel B and writes to Elasticsearch.

    If the final goroutine is stuck because it is retrying requests in queue, then it is busy and cannot read from B. If B is not being read from, then the middle go routine cannot write to it, so it waits. if the middle goroutine is waiting to write to B then it's not reading from A if nothing is reading from A then the MongoDB go routine is stuck waiting for a reader. It is in the middle of a query, not able to write results or proceed to fetch the next batch of results.

    The middle goroutine actually saves each timestamp that it receives from MongoDB when resume=true such that you can bring down the whole pipeline and resume where you left off.

    点赞 评论 复制链接分享
  • weixin_39886841 weixin_39886841 4月前

    Hey Ryan,

    I set debugging to true for mgo and saw a lot of log lines where there would be no response from the cursor or the topology was being synced very frequently. The ingestion speeds are abysmal and I'm running this on a macbook pro with plenty of ram (I also turned off extra logging in the 2nd run). Is there something that blocks how much data is read from mongo? I'm trying to understand whats blocking the ingestion speed here but I can't wrap my head around it.

    I've had the connector running for about 8 minutes and only 170 documents have been synced.

    Seeing a lot of these as well: ERROR 2019/02/15 19:35:22 Error reading segment of collection x. Will retry segment.: read tcp :63781->:27017: i/o timeout

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    One thing you can try is to build Monstache with the indexing part omitted just to get an idea if the indexing is backing up the reads.

    To do that you would just need to modify the monstache.go file at the very end to comment out the call to routeOp. Instead just print the op.

    This will remove any back pressure on reads that the write side may be introducing. If you make this change and then notice that it is printing docs at a much faster rate then you know that indexing into Elasticsearch must be adding back pressure to the reads causing them to time out. At that point you would need to look into why indexing is not going smoothly. That would be one thing I can think out which would slow down reads.

    If you still get the same slow output of documents then it really is some problem getting data out of MongoDB.

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    You might also try adding the following to your connection string.

    readPreference=primaryPreferred

    Or

    readPreference=nearest

    I think the default is that all reads go through the primary which might be cross AZ in your case.

    Did you see any error messages about unable to determine primary during the sync stage in the debug output?

    点赞 评论 复制链接分享
  • weixin_39886841 weixin_39886841 4月前

    You might also try adding the following to your connection string.

    readPreference=primaryPreferred

    Or

    readPreference=nearest

    I think the default is that all reads go through the primary which might be cross AZ in your case.

    Did you see any error messages about unable to determine primary during the sync stage in the debug output?

    Nope, every sync finds the primary and is able to determine the cluster topology.

    Is ingestion with a direct connection slower than a replica set connection? When I toggle the connection string between connect=direct and replicaSet="", the rate at which ops go through the map function fluctuates heavily, dropping with a direct connection and picking up with a replica set connection. Is there an expected difference between the different connection settings?

    BTW, I tested without the routeOp function and indexing seems to be the culprit in the slow ingestion speeds. I need to do some further testing, but what elasticsearch instance did you use in your testing?

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    I usually test with Elasticsearch 6.0 inside a VM. So the network latency is not a factor in this case.

    You might want to check your Elasticsearch cluster for correct sizing and configuration for lots of bulk requests coming from Monstache.

    点赞 评论 复制链接分享
  • weixin_39560002 weixin_39560002 4月前

    hai...Anything that is inserted in MongoDB is not reflected to ElasticSearch.

    I am looking for full sync + continuous updates to Elasticsearch...

    I do not want to restart to the Monstache with any configuration changes.

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    Filtering should not be expensive. It would be difficult for me to say why the discrepancy without knowing about the differences between dev and production.

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    it seems like this error message would be coming from this line. This is used when ssl is enabled.

    I've not tested with timeout settings like you provided. Did you override those because you were having problems with timeouts? Also, what version of monstache are you running with?

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    did you figure out what might be the problem? I think even your description of performance in dev settings seems maybe a bit low based on my own testing. Granted, my testing was in a VM with both Monstache and Elasticsearch local to monstache, but I have seen millions of documents indexed within minutes not hours. This is with the default settings - 4 concurrent Elasticsearch clients, 8MB bulk size, and 5 seconds auto flush. I guess with large documents or middleware introduced things could slow down at bit but I would still expect the process to be on the order of minutes.

    点赞 评论 复制链接分享
  • weixin_39886841 weixin_39886841 4月前

    it seems like this error message would be coming from this line. This is used when ssl is enabled.

    I've not tested with timeout settings like you provided. Did you override those because you were having problems with timeouts? Also, what version of monstache are you running with?

    Hey Ryan,

    Ya I was seeing similar errors before so I increased the timeouts and those errors went away but now these have started showing up. I deployed monstache in our production env and looking at the logs, for the past 9 hours messages like this have been printing out for only 2 of the provided hosts. Running the latest version of monstache too.

    As for the slow ingestion speeds, I'm not sure other than possibly IOPS on the mongodb cluster in dev. The plugin isn't doing anything complicated or making any network calls, so I doubt its adding any significant delay, but other than that, I'm not really sure. I've been using this for the gtm settings in both dev & prod: [gtm-settings] channel-size = 1024. Are there any logs I can provide to help out?

    点赞 评论 复制链接分享
  • weixin_39886841 weixin_39886841 4月前

    Restarted the container and saw this pop up in the logs: ERROR 2019/01/24 19:39:21 Unable to dial MongoDB: EOF ERROR 2019/01/24 19:39:21 Unable to dial MongoDB: EOF

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    Yes, that happens when you terminate mongod when monstache is running and then start mongod again. I am putting some work into recovery for the next release.

    点赞 评论 复制链接分享
  • weixin_39886841 weixin_39886841 4月前

    Had the container running the entire time since I restarted it this morning and now it's showing the same log lines: ERROR 2019/01/25 00:27:28 Unable to dial MongoDB: dial tcp [ip1]:27017: connect: connection timed out ERROR 2019/01/25 00:29:27 Unable to dial MongoDB: dial tcp [ip2]:27017: connect: connection timed out ERROR 2019/01/25 00:29:27 Unable to dial MongoDB: dial tcp [ip1]:27017: connect: connection timed out ERROR 2019/01/25 00:29:44 Unable to dial MongoDB: dial tcp [ip2]:27017: connect: connection timed out ERROR 2019/01/25 00:29:44 Unable to dial MongoDB: dial tcp [ip1]:27017: connect: connection timed out.

    As soon as I start seeing these logs, no data is getting sent to ES so does that mean there is nothing else left to read from Mongo?

    点赞 评论 复制链接分享
  • weixin_39778447 weixin_39778447 4月前

    I don't think it means there is nothing else to read but instead that the tls connection failed. Since this message is coming from the mgo golang driver you may want to check there.

    The code monstache uses when the connection is SSL enabled is

     go
    dialInfo.DialServer = func(addr *mgo.ServerAddr) (net.Conn, error) {
                conn, err := tls.Dial("tcp", addr.String(), tlsConfig)
                if err != nil {
                    errorLog.Printf("Unable to dial MongoDB: %s", err)
                }
                return conn, err
            }
    

    Where tlsConfig depends on if you've configured monstache with your own certificates.

    The mgo driver calls that function at connection time. Do you also get this with SSL disabled?

    点赞 评论 复制链接分享
  • weixin_39886841 weixin_39886841 4月前

    The same code is deployed in dev and I have yet to see even a single log line like this, and even more so, these lines appear hours after the process is started. I don't provide my own certs or pem files either, so monstache is using an empty tlsConfig object in that Dial command the entire time. Why would there be a delay if the tlsConfig was somehow invalid?

    I haven't tried running with SSL off because of the disparity between running in dev/prod, both connecting to similar clusters in mongo atlas.

    点赞 评论 复制链接分享

相关推荐