Operational Defect Database

BugZero found this defect 2481 days ago.

MongoDB | 383009

[SERVER-29160] Sharding commonly uses write concern timeouts of 15 seconds and these are timing out in migration related operations and causing BFs

Last update date:

10/30/2023

Affected products:

MongoDB Server

Affected releases:

No affected releases provided.

Fixed releases:

3.6.9

4.0.4

4.1.4

Description:

Info

Sharding's writeConcern timeouts related to writes performed throughout the migration process should be bumped higher to prevent BFs. This write specifically caused the linked BF. Any other related writes that can be bumped without seriously affecting the rest of the system should be as well. Proposing a bump to 30 second timeouts rather than the 15 second timeout that's the norm in sharding. suggested fix as we have 20 different kMajorityWriteConcern values defined in the anonymous namespaces but most still connected we can add the durations to write_concern_options.h static constexpr Seconds kWriteConcernTimeoutSharding{30}; static constexpr Seconds kWriteConcernTimeoutMigration{60}; static constexpr Seconds kWriteConcernTimeoutUserCommand{60}; and use as instead of const WriteConcernOptions kMajorityWriteConcern(WriteConcernOptions::kMajority, WriteConcernOptions::SyncMode::UNSET, Seconds(30)); use const WriteConcernOptions kMajorityWriteConcern(WriteConcernOptions::kMajority, WriteConcernOptions::SyncMode::UNSET, WriteConcernOptions::kWriteConcernTimeoutSharding);

Top User Comments

xgen-internal-githook commented on Wed, 3 Oct 2018 17:38:40 +0000: Author: {'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'} Message: SERVER-29160 bump timeout for migration operations to 60sec Branch: v3.6 https://github.com/mongodb/mongo/commit/5f8dfb3ca2bac43ba65f2fb9907953deb96fba91 xgen-internal-githook commented on Wed, 3 Oct 2018 17:30:44 +0000: Author: {'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'} Message: SERVER-29160 bump timeout for migration operations Branch: v4.0 https://github.com/mongodb/mongo/commit/54cf97e5366ad421ed775d6fd74d2aa4fddaed02 xgen-internal-githook commented on Sat, 29 Sep 2018 02:09:48 +0000: Author: {'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'} Message: SERVER-29160 follow up change wtimeout in unit tests to match Branch: v3.6 https://github.com/mongodb/mongo/commit/890e5c76917c6bad1e00eb1e5d6dfadfd94db573 xgen-internal-githook commented on Fri, 28 Sep 2018 20:20:47 +0000: Author: {'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'} Message: SERVER-29160 bump timeout for migration operations (cherry picked from commit 3f618b86df0473ab905cc4a0ad78f4be8d3428e3) Branch: v3.6 https://github.com/mongodb/mongo/commit/ef72d37036d96ba77f7b528df1b8952441cc66ad xgen-internal-githook commented on Tue, 25 Sep 2018 00:52:01 +0000: Author: {'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'} Message: SERVER-29160 bump timeout for migration operations Branch: master https://github.com/mongodb/mongo/commit/3f618b86df0473ab905cc4a0ad78f4be8d3428e3 misha.tyulenev commented on Mon, 10 Sep 2018 19:33:30 +0000: For the default in command.cpp https://github.com/mongodb/mongo/blob/master/src/mongo/db/commands.cpp#L74 esha.maharishi@10gen.com commented on Mon, 10 Sep 2018 19:28:20 +0000: Looks great! One question, why do we need kWriteConcernTimeoutUserCommand? misha.tyulenev commented on Mon, 10 Sep 2018 19:26:38 +0000: esha.maharishi please ack the approach outlined in the description dianna.hohensee commented on Thu, 12 Oct 2017 16:00:40 +0000: Linking BF-6834 because it has similar issues, though not migration related commands. It's also a bit odd looking. It shows a {{writeConcern: { w: \"majority\", wtimeout: 15000 } }}} 15 second timeout, but takes 39 seconds to complete, and completes after the test fails. Perhaps a config set network timeout of 30 seconds, and then the write on the shard had a 15 second timeout set. dianna.hohensee commented on Fri, 16 Jun 2017 21:19:01 +0000: BF-5723's scenario is startCommit timing out after 30 seconds, followed closely by the migrateThread timing out (and failing the migration) after 15 seconds. Consider upping one of those timeouts. However SERVER-29698 is solved might have some effect on those timeouts.

Additional Resources / Links

Share:

BugZero Risk Score

Coming soon

Status

Closed

Have you been affected by this bug?

cost-cta-background

Do you know how much operational outages are costing you?

Understand the cost to your business and how BugZero can help you reduce those costs.

Discussion

Login to read and write comments.

Have you ever...

had your data corrupted from a

VMware

bug?

Search:

...