Operational Defect Database

BugZero found this defect 2494 days ago.

MongoDB | 377683

[SERVER-28974] Mongos leak connections to mongods

Last update date:

10/27/2023

Affected products:

MongoDB Server

Affected releases:

3.2.12

Fixed releases:

No fixed releases provided.

Description:

Info

We ran into an issue where mongos was suddenly creating a lot of connections to mongods and ran out of file descriptors. This eventually consumed all descriptors also on mongods which then crashed (https://jira.mongodb.org/browse/SERVER-28933). This happened when there was some intermittent network congestion. We saw errors like this when the leaks occurred: 2017-04-24T21:48:41.865+0000 I NETWORK [ReplicaSetMonitorWatcher] Socket recv() timeout 10.123.0.1:27017 2017-04-24T21:48:41.865+0000 I NETWORK [ReplicaSetMonitorWatcher] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [10.123.0.1:27017] 2017-04-24T21:48:41.865+0000 I NETWORK [ReplicaSetMonitorWatcher] Detected bad connection created at 1493026624910579 microSec, clearing pool for mongo5:27017 of 0 connections "0 connections" there is particularly interesting. Eventually the leaking lead to running out of descriptors: 2017-04-24T21:48:42.487+0000 I NETWORK [mongosMain] Listener: accept() returns -1 errno:24 Too many open files 2017-04-24T21:48:42.487+0000 E NETWORK [mongosMain] Out of file descriptors. Waiting one second before trying to accept more connections. The logs indicated that there would be about 500 open client connections, but when inspecting with lsof, it revealed that there were less than 10. It also revealed that there were 6 connections to the mongocs, 23 connections to the primary and total of 474 connections to the two secondaries. It also showed 478 "sock" descriptors with "can't identify protocol": mongos 5495 mongodb 1023u sock 0,7 0t0 84718243 can't identify protocol Normally there isn't any of those and there's just a few connections to the secondaries as well. This is currently a bit special setup with a single shard (replicaset of 3 nodes) and about 200 mongoses that connect to it. The leaking occurred on all of the mongos instances (all were also affected by the network congestion issue). This is what connPoolStatus showed around that time (although, I've understood that it's not entirely reliable): MongoDB shell version: 3.2.12 connecting to: test mongos> db.runCommand( { "connPoolStats" : 1 } ) { "numClientConnections" : 6, "numAScopedConnections" : 0, "totalInUse" : 478, "totalAvailable" : 24, "totalCreated" : 5000, "hosts" : { "mongocdata3:27019" : { "inUse" : 0, "available" : 1, "created" : 1 }, "mongocdata2:27019" : { "inUse" : 0, "available" : 1, "created" : 1 }, "mongocdata1:27019" : { "inUse" : 0, "available" : 1, "created" : 1 }, "mongo7:27017" : { "inUse" : 0, "available" : 1, "created" : 7 }, "mongo6:27017" : { "inUse" : 0, "available" : 1, "created" : 4 }, "mongo5:27017" : { "inUse" : 0, "available" : 1, "created" : 18 } }, "replicaSets" : { "datacfg" : { "hosts" : [ { "addr" : "mongocdata1:27019", "ok" : true, "ismaster" : false, "hidden" : false, "secondary" : true, "pingTimeMillis" : 0 }, { "addr" : "mongocdata2:27019", "ok" : true, "ismaster" : true, "hidden" : false, "secondary" : false, "pingTimeMillis" : 0 }, { "addr" : "mongocdata3:27019", "ok" : true, "ismaster" : false, "hidden" : false, "secondary" : true, "pingTimeMillis" : 0 } ] }, "rs1" : { "hosts" : [ { "addr" : "mongo5:27017", "ok" : true, "ismaster" : true, "hidden" : false, "secondary" : false, "pingTimeMillis" : 0 }, { "addr" : "mongo6:27017", "ok" : true, "ismaster" : false, "hidden" : false, "secondary" : true, "pingTimeMillis" : 9 }, { "addr" : "mongo7:27017", "ok" : true, "ismaster" : false, "hidden" : false, "secondary" : true, "pingTimeMillis" : 4 } ] } }, "ok" : 1 }

Top User Comments

thomas.schubert commented on Fri, 9 Jun 2017 21:24:39 +0000: Hi devastor, Thank you for the update. Since this is no longer an issue for you, I'm resolving this ticket. Please let us know if you encounter similar behavior in the future, and we will continue to investigate. Kind regards, Thomas devastor commented on Tue, 30 May 2017 09:49:37 +0000: Hi Thomas, Unfortunately, we have not seen it after upgrading. Also our setup changed a bit so we don't have that many mongoses any more, which might affect it, too. thomas.schubert commented on Wed, 24 May 2017 08:41:58 +0000: Hi devastor, Have you encountered this issue again since upgrading? Thanks, Thomas devastor commented on Wed, 3 May 2017 15:47:40 +0000: Thanks Thomas, Just to update, we are now running 3.2.13 and periodically logging the output of connPoolStats, so let's see what happens. thomas.schubert commented on Thu, 27 Apr 2017 18:44:24 +0000: Hi devastor, Thank you for reporting this behavior. We've completed a large amount of work in MongoDB 3.2.13 to improve diagnostics to investigate issues such as this. The release candidate, MongoDB 3.2.13-rc0, is available now, and general availability release is scheduled for early next week. In particular MongoDB 3.2.13 includes: Significant improvements to connPoolStats (including SERVER-25218) Log time to establish egress connections through NetworkInterfaceASIO (SERVER-28259) Add logging to track egress connection pool size (SERVER-28652) Add egress logging to legacy write path (SERVER-28760) Would you please upgrade and periodically collect the output of connPoolStats? Please feel free to increase the delay (in seconds) to a larger number that would still show the issue emerge according to the timelines you've observed previously. delay=1 mongo --eval "while(true) {print(JSON.stringify(db.serverStatus().localTime));print(JSON.stringify(db.runCommand( { "connPoolStats" : 1 } ))); sleep(1000*${delay:?})}" >connPoolStats.log & If you encounter this issue again, would you please upload the following diagnostic information: The connPoolStats.log file Complete log files of an affected mongos Complete log files of an affected mongod An archive of the diagnostic.data directory of the affected mongod I've created a secure upload portal for you to provide these files. Files uploaded to this portal are only visible to MongoDB employees and are routinely deleted after some time. Thank you for your help, Thomas

Additional Resources / Links

Share:

BugZero Risk Score

Coming soon

Status

Closed

Have you been affected by this bug?

cost-cta-background

Do you know how much operational outages are costing you?

Understand the cost to your business and how BugZero can help you reduce those costs.

Discussion

Login to read and write comments.

Have you ever...

had your data corrupted from a

VMware

bug?

Search:

...