Operational Defect Database

BugZero found this defect 2631 days ago.

MongoDB | 338388

[SERVER-27393] Balancer taking 100% CPU due to large number of dropped sharded collections

Last update date:

4/5/2017

Affected products:

MongoDB Server

Affected releases:

3.4.0

Fixed releases:

3.4.2

3.5.2

Description:

Info

Balancer started to take 100% CPU after upgrading to 3.4.0 from 3.2.9 and enabling the balancer. This is a DB with 4 shards (rs1, rs2, rs3 and rs4), but before upgrading we removed one shard (rs5), waiting until the drain completed. In the log I can see warnings like these for several collections: 2016-12-13T09:52:11.211+0000 W SHARDING [Balancer] Unable to enforce tag range policy for collection eplus.wifiCollection_20161008 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161008 2016-12-13T09:52:13.087+0000 W SHARDING [Balancer] Unable to enforce tag range policy for collection eplus.wifiCollection_20161009 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161009 2016-12-13T09:53:38.583+0000 W SHARDING [Balancer] Unable to balance collection eplus.wifiCollection_20161008 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161008 2016-12-13T09:53:40.360+0000 W SHARDING [Balancer] Unable to balance collection eplus.wifiCollection_20161009 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161009 Those collections are created and dropped after some days. And indeed those collections were dropped and are not shown in "db.getCollectionNames()". I investigated a bit and found those collections in config DB: { "_id" : "eplus.wifiCollection_20161008", "lastmodEpoch" : ObjectId("000000000000000000000000"), "lastmod" : ISODate("2016-10-18T04:00:13.108Z"), "dropped" : true } { "_id" : "eplus.wifiCollection_20161009", "lastmodEpoch" : ObjectId("000000000000000000000000"), "lastmod" : ISODate("2016-10-19T04:00:48.158Z"), "dropped" : true } And there are locks for many collections related to the removed shard (rs5): { "_id" : "eplus.wifiCollection_20160908", "state" : 0, "ts" : ObjectId("5837ee01c839440f1e70d384"), "who" : "wifi-db-05a:27018:1475838481:-1701389523:conn104", "process" : "wifi-db-05a:27018:1475838481:-1701389523", "when" : ISODate("2016-11-25T07:53:37.235Z"), "why" : "migrating chunk [{ lineId: 8915926302292949940 }, { lineId: MaxKey }) in eplus.wifiCollection_20160908" } { "_id" : "eplus.wifiCollection_20160909", "state" : 0, "ts" : ObjectId("5837ee01c839440f1e70d38b"), "who" : "wifi-db-05a:27018:1475838481:-1701389523:conn104", "process" : "wifi-db-05a:27018:1475838481:-1701389523", "when" : ISODate("2016-11-25T07:53:37.296Z"), "why" : "migrating chunk [{ lineId: 8915926302292949940 }, { lineId: MaxKey }) in eplus.wifiCollection_20160909" } Not only there are locks for dropped collections but also for existant collections. Our guessing is that this is causing Balancer to continuously loop over all collections, and thus causing 100% CPU, but we are not sure how to work around.

Top User Comments

xgen-internal-githook commented on Thu, 22 Dec 2016 17:57:05 +0000: Author: {u'username': u'nathan-myers-mongo', u'name': u'Nathan Myers', u'email': u'nathan.myers@10gen.com'} Message: SERVER-27393 Balancer should skip dropped collections Branch: v3.4 https://github.com/mongodb/mongo/commit/81e1e3cc42a592cdb9e3ca32a43eec2a3790a18f xgen-internal-githook commented on Wed, 21 Dec 2016 22:47:50 +0000: Author: {u'username': u'nathan-myers-mongo', u'name': u'Nathan Myers', u'email': u'nathan.myers@10gen.com'} Message: SERVER-27393 Balancer should skip dropped collections Branch: master https://github.com/mongodb/mongo/commit/ba15d5ba13359ba0aedbb58ee10cfc919fd9e0f6 icruz commented on Wed, 14 Dec 2016 09:17:54 +0000: Just a note: with just one "dropped" collection in config.collections, the balancer keeps retrying with a pause of 10 seconds. As soon as I removed all entries with dropped=true, there are no more logs of Balancer in cfg server, and CPU usage reduces to 1%. kaloian.manassiev commented on Tue, 13 Dec 2016 17:59:17 +0000: Summary of the problem: The sharding balancer uses the entries in config.collections as means to decide which collections to balance. This is done without any predicate on the state of the collection and means that we try to balance dropped collections. For customers which frequently drop and recreate collections, this causes load on the server due to the need to issue a chunks query for each collection. We should make the balancer skip dropped namespaces. kaloian.manassiev commented on Tue, 13 Dec 2016 17:58:02 +0000: Thanks icruz. Please watch this ticket for updates on when the fix becomes available. Best regards, -Kal. icruz commented on Tue, 13 Dec 2016 16:00:05 +0000: Hi Kal, thanks for your help, after removing dropped collections CPU load went back to normal. There were only 47 dropped collections. but cfg server is in a t2.micro from Amazon so it has very little CPU available. I'm attaching the log from config server, only for 1 hour since it's essentially repeating. kaloian.manassiev commented on Tue, 13 Dec 2016 15:45:57 +0000: Hi icruz, Thank you for the report and the investigation. Just like you noticed, it looks like the balancer code is not ignoring the deleted collections (those with dropped:true) and when we try to treat them as sharded we get an error, because there are no chunks for them. This error is benign, but since you most likely have a lot of these dropped collections it is causing a load on your server because of the number of queries the config server has to make. We are going to fix this by skipping dropped collections. In the mean time, you can work around this problem by manually dropping all collection entries with dropped:true using this command: mongos> use config; mongos> db.collections.remove({dropped:true}); WriteResult({ "nRemoved" : 1000 }) Do you mind letting us know how many dropped collections do you have just so we can write a realistic test for this problem (I created a database with 1000 dropped collection and was unable to get such a high CPU utilization): mongos> use config; mongos> db.collections.count({dropped:true}); 1000 About your question for the locks - these locks have state:0 which means "unlocked". These records never get cleaned up and that's why you are seeing them. Finally, would it be possible that you attach the log from the config server primary so we can have a look at all the errors which you are getting? Thank you in advance. Best regards, -Kal.

Additional Resources / Links

Share:

BugZero Risk Score

Coming soon

Status

Closed

Have you been affected by this bug?

cost-cta-background

Do you know how much operational outages are costing you?

Understand the cost to your business and how BugZero can help you reduce those costs.

Discussion

Login to read and write comments.

Have you ever...

had your data corrupted from a

VMware

bug?

Search:

...