Operational Defect Database

BugZero found this defect 2475 days ago.

MongoDB | 383556

[SERVER-29206] MongoS [NetworkInterfaceASIO-TaskExecutorPool-0-0] ExceededTimeLimit: Operation timed out

Last update date:

1/29/2018

Affected products:

MongoDB Server

Affected releases:

3.2.12

Fixed releases:

No fixed releases provided.

Description:

Info

We continue to receive the below errors on our mongos logs; which causes connection issues to our mongo cluster. This happens throughout the day every day. We have 4 mongos nodes. About to have 8 to see if that helps. All 4 will see the issue. mongos.log: 2017-05-15T14:30:16.501+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to c0-r4-data-001.mongo-euc-1.com:27017 - ExceededTimeLimit: Operation timed out mongos.log: 2017-05-15T14:30:16.501+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to c0-r11-data-001.mongo-euc-1.com:27017 - ExceededTimeLimit: Operation timed out mongos.log: 2017-05-15T14:30:16.501+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to c0-r9-data-003.mongo-euc-1.com:27017 - ExceededTimeLimit: Operation timed out mongos.log: 2017-05-15T14:30:16.501+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to c0-r9-data-003.mongo-euc-1.com:27017 - ExceededTimeLimit: Operation timed out mongos.log: 2017-05-15T14:30:16.534+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to c0-r6-data-003.mongo-euc-1.com:27017 - ExceededTimeLimit: Operation timed out mongos.log: 2017-05-15T14:30:16.534+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to c0-r6-data-003.mongo-euc-1.com:27017 - ExceededTimeLimit: Operation timed out mongos.log: 2017-05-15T14:30:16.534+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to c0-r8-data-003.mongo-euc-1.com:27017 - ExceededTimeLimit: Operation timed out mongos.log: 2017-05-15T14:30:18.442+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to c0-r8-data-003.mongo-euc-1.com:27017 - ExceededTimeLimit: Operation timed out Below is our configuration file: --- net: port: '27017' bindIp: 0.0.0.0 systemLog: path: "/var/log/mongodb/mongos.log" logAppend: true destination: file sharding: autoSplit: true configDB: c0-config-001.mongo-euc-1.com:27017,c0-config-002.mongo-euc-1.com:27017,c0-config-003.mongo-euc-1.com:27017 chunkSize: 64 setParameter: cursorTimeoutMillis: '1800000' security: keyFile: "*********" We have tried a variety of options none have helped: setParameter: cursorTimeoutMillis: '1800000' ShardingTaskExecutorPoolHostTimeoutMS: '300000000' ShardingTaskExecutorPoolMinSize: '50' setParameter: cursorTimeoutMillis: '1800000' ShardingTaskExecutorPoolRefreshTimeoutMS: 60000 ShardingTaskExecutorPoolHostTimeoutMS: 300000000 ShardingTaskExecutorPoolMinSize: 125 When using any of the config options above; we see the errors occur pretty quickly. "opcounters" : { "insert" : 5369, "query" : 431092, "update" : 283293, "delete" : 5692, "getmore" : 88, "command" : 760326 },

Top User Comments

jason.carey commented on Wed, 21 Jun 2017 20:37:16 +0000: Now that we've picked a release for the backport of SERVER-29237 (3.2.15, not yet schedule, but the next one up), I'm marking this ticket as a duplicate of that and closing this out. If you have any questions, feel free to re-open or open a new ticket as needed. jason.carey commented on Tue, 16 May 2017 21:08:51 +0000: You may need to go down a bit further, if 200 for max pool size is too large. Remember that it's effectively N cores * Pool size * Num_mongos' In a separate avenue, I'm exploring a rate limiter on new connection creation in SERVER-29237, which I'm attempting to keep minimal so that a 3.2 backport is on the table. It will provide a tunable that limits the number of parallel connecting connections from any connection pool to any host. If your problem is always about the initial connection storm, and the load would otherwise be sustainable (after connections are established) it may do the trick. wegerer commented on Tue, 16 May 2017 00:03:58 +0000: I just tried this : ShardingTaskExecutorPoolHostTimeoutMS: '300000000' ShardingTaskExecutorPoolMinSize: '100' ShardingTaskExecutorPoolMaxSize: '200' and ShardingTaskExecutorPoolHostTimeoutMS: '300000000' ShardingTaskExecutorPoolMinSize: '100' ShardingTaskExecutorPoolMaxSize: '3200' Still seeing the issue. I even waited over 5 minutes after restarting the mongos process to see if they would clear; but they did not. Saw these errors : 2017-05-15T21:51:07.742+0000 I NETWORK [NetworkInterfaceASIO-TaskExecutorPool-2-0] Marking host c0-r8-data-003.mongo-euc-1.com:27017 as failed :: caused by :: ExceededTimeLimit: Couldn't get a connection within the time limit 2017-05-15T21:51:07.743+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to c0-r5-data-001.mongo-euc-1.com:27017 - ExceededTimeLimit: Operation timed out 2017-05-15T21:51:07.743+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to c0-r5-data-001.mongo-euc-1.com:27017 2017-05-15T21:51:07.743+0000 I NETWORK [NetworkInterfaceASIO-TaskExecutorPool-2-0] Marking host c0-r2-data-002.mongo-euc-1.com:27017 as failed :: caused by :: ExceededTimeLimit: Couldn't get a connection within the time limit 2017-05-15T21:51:07.743+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to c0-r4-data-004.mongo-euc-1.com:27017 - ExceededTimeLimit: Operation timed out 2017-05-15T21:51:07.743+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to c0-r2-data-003.mongo-euc-1.com:27017 - ExceededTimeLimit: Operation timed out 2017-05-15T21:51:07.743+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to c0-r2-data-003.mongo-euc-1.com:27017 2017-05-15T21:51:07.743+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to c0-r2-data-003.mongo-euc-1.com:27017 - ExceededTimeLimit: Operation timed out 2017-05-15T21:56:01.473+0000 I NETWORK [NetworkInterfaceASIO-TaskExecutorPool-3-0] Marking host c0-r5-data-002.mongo-euc-1.com:27017 as failed :: caused by :: ExceededTimeLimit: Couldn’t get a connection within the time limit   2017-05-15T21:56:01.534+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to c0-r0-data-003.mongo-euc-1.com:27017 - ExceededTimeLimit: Operation timed out   2017-05-15T21:56:02.818+0000 I NETWORK [mongosMain] connection accepted from 172.26.41.204:5603 #14083 (1177 connections now open) I was able to get this config in and things not report the errors so far; though will continue to monitor : setParameter: cursorTimeoutMillis: '1800000' ShardingTaskExecutorPoolHostTimeoutMS: '300000000' ShardingTaskExecutorPoolMinSize: '50' ShardingTaskExecutorPoolRefreshTimeoutMS: '600000' jason.carey commented on Mon, 15 May 2017 20:58:05 +0000: jonhyman and wegerer, I'm summarizing my thoughts at the top, and more directly responding to your comment below. The major difference between 3.0 and 3.2 mongos is that we can and do open connections to mongod much more concurrently and much faster. This can lead to connection storms that swamp mongod, reducing performance and eventually leading to timeouts as new connections are created (because the old ones aren't coming free fast enough) This behavior is somewhat hidden in versions < 3.2.8 because we never removed old connections. This often led to a surplus of completely idle connections (which are not expensive on mongod) The only current work around is to set max pool size, which will prevent mongos from opening too many connections at once Short term development on our side will explore rate limiting new connection creation Responding to your comment in order: Setting max pool size We did not set a max pool size, I am fearful of limiting connections. We have thousands of processes which connect to our mongoS, in total we run tens of thousands of application server threads which may talk to the database and I am fearful of limiting connections to something as small as 200*num_mongos*num_cores_per_mongos. Unfortunately, that's exactly the setting that you need in order to work around this problem. First, let me try to explain a little bit how setting the maximum size of the connection pool is likely to influence the running characteristics of your cluster, as well as where it is and is not dangerous. Suppositions For all of these scenarios, presume: N total active client connections that would like to talk to a particular shard, where N > the number of requests that shard can concurrently handle (perhaps 10k before perf degrades) M shards Scenarios No max pool size set responsive db + fast operations Assuming the traffic grows gradually, each client may open 100's of connections to a particular mongos. We see some evidence of this in the > 6k ingress connections noted in the logs Assuming requests can be handled quickly, mongos -> mongod connections will be aggressively re-used. Let's say < 10 connections per mongos to mongod Each mongod will attempt to run < 1000 total concurrent operations, which will keep throughput up responsive db + fast operations + small number of slow ops Same as above, but an extra few active threads on the mongod for those long running operations responsive db + large number of slow ops Over time, many connections will open to the shard which will consume ever larger server resources If the server is still able to satisfy requests in time, it will merely be slow If it get's behind, we'll eventually see an unresponsive db temporarily unresponsive db This might occur after a primary step down, as all requests paused for a moment and are attempted en-mass to the new primary each mongos will open N/M connections to that mongod, which will prevent it from ever satisfying requests quickly, and will lead to timeouts Max pool size set responsive db + fast operations Identical to without the max pool size set If operations are fast, we'll never hit the pool cap responsive db + fast operations + some slow ops Identical to without the max pool size set If operations are fast, we'll never hit the pool cap If max pool > concurrent slow ops, we'll also avoid hitting the cap responsive db + many slow ops Slow ops may crowd out fast ops leading to poor p95/99 latency temporarily unresponsive db Requests will queue in mongos, and will not burst down the server after it becomes available again Requests will be served in soonest-to-expire order If max pool size > long running ops, this is likely to be unnoticeable. Conclusion Setting max pool size is unlikely to worsen throughput, as actually using more than 3-4k or so threads on a mongod concurrently rarely scales. Things like p99 and p95 latency can theoretically worsen, but only do so compared to a working system if max pool size is less than the desired number of long running operations (so that long running ops crowd out short ones) Further server development In thinking about this ticket and our problems, are you continuing to do more investigation based on these settings and what works to make changes to how the router works, or is your current approach that we should continue to tweak these settings as needed? The problem as I see with the latter that when the ExceededTimeLimit issue occurs, we have downtime of some kind due to read/write failures to the mongoS. It's a really bad situation when that happens. I asked about your configuration under these various settings mostly to understand whether it was possible that you were dropping connections out due to periods of idleness. The efficacy of setting the host timeout very high seems to imply that you occasionally see no traffic at all to some hosts, followed by traffic spikes. In terms of long term mongos development, it most strongly indicates that we need back pressure from mongod, to prevent mongos from opening so many connections at once. It may be worth noting that the only form this backpressure will be able to take is forcing requests to queue on the mongos (much like setting the max pool size). I don't think that more tweaking of this configuration is likely to help you without looking at setting max pool size scalability of configuration driven solutions The instances that we're running 3.2.12 on right now have lower load by more than an order of magnitude from our busiest mongo instances, so I am extremely concerned that even if we get these settings right on one instance as Antonios has – we now have a 3.2.12 instance that we haven't seen the problem on, but it's about 4-10x smaller/less busy than the one Wayne is reporting problems on, which is at least 10x smaller/less busy than our busiest instances – that we're going to be relying on fragile configuration that may not support more application server growth or throughput. I suspect that the stable configuration Antonios has found is simply one that always keeps around enough connections that it rarely if ever needs to make new ones. In particular, without setting the max pool size, you have enough application pressure to swamp your shards. While this is an issue that we can and should work around in software, by slowing the speed the rate of new connection creation in mongos, note that we'll still fail in periods of system unresponsiveness (as we'll wait longer, but we'll still create too many concurrent operations on a mongod). Long term solutions will have to explore backpressure and queueing outside of mongod.

Additional Resources / Links

Share:

BugZero Risk Score

Coming soon

Status

Closed

Have you been affected by this bug?

cost-cta-background

Do you know how much operational outages are costing you?

Understand the cost to your business and how BugZero can help you reduce those costs.

Discussion

Login to read and write comments.

Have you ever...

had your data corrupted from a

VMware

bug?

Search:

...