Operational Defect Database

BugZero found this defect 2391 days ago.

MongoDB | 415121

[SERVER-30600] mongos does not detect stale config when clients use non-primary read preferences

Last update date:

9/7/2017

Affected products:

MongoDB Server

Affected releases:

3.4.5

Fixed releases:

No fixed releases provided.

Description:

Info

Mongos instances which do not receive any requests with the primary read preference do not get their chunk location configuration updated after a chunk migration. This results in missing data in query results in cases where the query includes the shard key and the mongos routes the query to the wrong shard. The only workarounds I have come up with so far is to hit every mongos instance with a dummy primary read pref query for each sharded collection (or maybe call the refresh command against the mongos) at some regular interval. Background info: I run a single 5-node replica which spans 3 data centers. 3 nodes in the central "primary" DC, 1 node in each of our regional "secondary" DCs. My application is read-only, runs in all 3 DCs, has high read performance requirements, and high tolerance for eventual consistency. As a result, I run with the "nearest" read preference so that my app running in a regional DC will prefer to read from the mongodb secondary replica running in the same DC, rather than going all the way back to the primary mongodb in the central DC. We've hit VM RAM capacity issues, and are now attempting to shard in-place into 3 shards, with a mongos instance co-located with each app instance. Everything went smoothly at first, I allowed the balancer to migrate some chunks to the new shards. After a few chunks I disabled the balancer to verify no production errors, and found that objects which had moved are no longer coming back in queries by shard key. If I make an identical query agains the mongos from the shell (which defaults to primary read preference) I see the following in the logs and get correct results: 2017-08-10T17:30:45.750+0000 D QUERY [conn87] Received error status for query query: { guid: "some_guid" } sort: {} projection: {} on attempt 1 of 10: SendStaleConfig: [MyDb.myCollection] shard version not ok: version mismatch detected for MyDb.myCollection ( ns : MyDb.myCollection, received : 118|0||598b5cf1b6ff8d56d195d96f, wanted : 121|1||598b5cf1b6ff8d56d195d96f, send ) Afterwards, my app's queries (using readPref=nearest) correctly return the same results.

Top User Comments

mark.agarunov commented on Thu, 17 Aug 2017 19:15:21 +0000: Hello skelly, As this behavior seems to be due to the same underlying issue as SERVER-28948, I've closed this ticket as a duplicate. Please follow SERVER-28948 for updates on this issue. Thanks, Mark schwerin commented on Sat, 12 Aug 2017 14:51:51 +0000: I think that's your best choice today. Disable the balancer if your data naturally has an even distribution, maybe. skelly commented on Fri, 11 Aug 2017 18:53:36 +0000: Do any of you have any recommendation for a workaround in the meantime? I was thinking I could run a separate thread that makes a simple query against each mongos using the primary read preference at some reasonable interval. Thoughts on that approach? skelly commented on Fri, 11 Aug 2017 18:49:10 +0000: Great thanks a lot guys for the quick response. I'm hoping I can develop a reasonable workaround until the feature comes along. I expect chunk migrations will be rare in my deployment anyway, at least after the initial balancing of the data. schwerin commented on Fri, 11 Aug 2017 16:56:16 +0000: OK. skelly, this is a duplicate of a feature request that we've been developing for the 3.6 release. It's sufficiently complicated that it cannot be back ported, I'm afraid, but should be available later this year. dianna.hohensee, can you help mark.agarunov out by selecting an appropriate ticket that is duplicated by this one, so Seth can track this work if he wants? dianna.hohensee commented on Fri, 11 Aug 2017 13:35:28 +0000: Yes, it will be resolved by our safe secondary reads project in v3.6. Secondaries do not currently (v3.4 or earlier) use routing information to filter results, and in v3.6 they will. To fully resolve his problem, he will likely need to use after cluster time reads (also a v3.6 feature) in order to ensure secondaries are not lagging behind their primaries, in case a mongos that is used to do the secondary reads has a stale shardVersion. schwerin commented on Fri, 11 Aug 2017 12:21:10 +0000: I believe the safe secondary reads project will resolve this. dianna.hohensee?

Additional Resources / Links

Share:

BugZero Risk Score

Coming soon

Status

Closed

Have you been affected by this bug?

cost-cta-background

Do you know how much operational outages are costing you?

Understand the cost to your business and how BugZero can help you reduce those costs.

Discussion

Login to read and write comments.

Have you ever...

had your data corrupted from a

VMware

bug?

Search:

...