Operational Defect Database

BugZero found this defect 240 days ago.

MongoDB | 2448811

Query operations that avoid going through the network when a shard is targeting only itself should create a fresh operation context

Last update date:

3/13/2024

Affected products:

MongoDB Server

Affected releases:

No affected releases provided.

Fixed releases:

No fixed releases provided.

Description:

Info

Trying to keep this ticket generic, although I believe that most of the related issues / BF tickets are related to the optimization performed in SERVER-77427. Why didn't we see these errors before? Mostly due to SERVER-81007. The FSM framework was swallowing some exceptions. This was addressed in 7.1. Affected versions The bug that caused those errors to be shallowed  (SERVER-79735) was introduced on 7.1.  Thus, I would expect that if any previous version was suffering of these failures at least our testing coverage would have caught them. What I believe is the conflicting commit (SERVER-77427) was introduced on 7.1, which matches with the timeline. About the issue In the query framework we have at least one place in which if we detect that an operation that was going to be sent through the network is just involving the current shard, we end up executing it locally. This local execution is performed without creating a fresh operation context (i.e. reusing the existing one), which could be problematic, specially for components like the OperationShardingState that assumes that a pair of can only be associated with one shard version. Note that this is just an example but we could have other components doing similar assumptions. The problematic interleaving is the following: Router sends an operation with SV1 and nss on a certain shard. Shard start its execution, registering the with SV1. A DDL operation bump the shardVersion of that shard to SV2. The shard continues the execution of the operation, and it reaches a point in which in theory it has to send a request to a set of shards. This set only includes itself, so it ends up performing a direct execution of the command. We try to register the pair with SV2 on the OperationShardingState but because of 2. we end up failing. Update it even looks like we don't really need to have a bump of the shard version (step 3): some kind of query operations, I believe that the ones that use $mergeCursors, initially attach ChunkVersion::IGNORED, so if that operation needs to perform some targeting and end ups deciding to perform a local execution it, it will also hit the same failure (see BF-30079). How should we fix it? I asked about the semantics of command re-entrance to service-arch. They mentioned that ideally we should create a fresh operation context in this situations, to avoid hitting assumptions like the one explained above. Another thing we could consider is reverting SERVER-77427 if it ends up being the root cause of the failures.

Top User Comments

JIRAUSER1270969 commented on Wed, 15 Nov 2023 09:14:39 +0000: Currently the main way to test this is $lookup. However, currently it doesn't support unsplittable collections, so I am blocking this ticket until we do SERVER-79580. It will give us good test coverage for potential solutions. xgen-internal-githook commented on Wed, 27 Sep 2023 10:26:34 +0000: Author: {'name': 'Ivan Fefer', 'email': 'ivan.fefer@mongodb.com', 'username': 'Fefer-Ivan'} Message: SERVER-81335 Fallback to remote read if local read for sub pipeline fails with IllegalChangeToExpectedShardVersion Branch: master https://github.com/mongodb/mongo/commit/b5330fcbc5989e6f7750bba166379ea75d47502f JIRAUSER1270969 commented on Wed, 27 Sep 2023 08:10:02 +0000: Revert is merged to 7.1 branch: https://github.com/mongodb/mongo/commit/aae297113407b219885bfe44f25a1fb377e04557 The ticket remains open, so we can properly fix this issue in master for the future releases. JIRAUSER1274113 commented on Tue, 26 Sep 2023 22:32:17 +0000: Hi ivan.fefer@mongodb.com, What is the actual status of this one? Do you still need to merge the change/revert to the 7.1 branch?

Steps to Reproduce


Additional Resources / Links

Share:

BugZero® Risk Score

What's this?

Coming soon

Status

Open

Learn More

Search:

...