Operational Defect Database

BugZero found this defect 306 days ago.

MongoDB | 2394695

Server returns unexpected CollectionUUIDMismatch with actual collection name equal to expected collection name

Last update date:

3/15/2024

Affected products:

MongoDB Server

Affected releases:

7.1.0-rc0

6.0.6

Fixed releases:

No fixed releases provided.

Description:

Info

In a Mongosync test, we encountered a CollectionUUIDMismatch where the actual collection name was equal to the expected collection when issuing a delete command on a sharded collection in a 3-shard cluster. The collection UUID of interest is e97a6bd1-498d-4dbf-8477-d77190fb744b for namespace "testDB.testColl2". A first observation from the mongos logs is that the collection testColl2 consists of 2 chunks, one in shard dst-sh01 and another in dst-sh02, leaving dst-sh03 with no chunks. This was explicitly setup by Mongosync using the "updateZoneKeyRange" command on those two shards with those specific chunks and intentionally omitting dst-sh03. {"t":{"$date":"2023-07-17T20:36:38.985+00:00"},"s":"D3", "c":"EXECUTOR", "id":22608, "ctx":"ShardRegistry","msg":"Received remote response","attr":{"response":"RemoteOnAnyResponse --  cmd: { cursor: { firstBatch: [ { _id: \"testDB.testColl2\", lastmodEpoch: ObjectId('64b5a6557ce7fa44 aa9b9b67'), lastmod: new Date(1689626198046), timestamp: Timestamp(1689626197, 140), uuid: UUID(\"e97a6bd1-498d-4dbf-8477-d77190fb744b\"), key: { _id: 1 }, unique: false, noBalance: false }, { chunks: { _id: ObjectId('64b5a656bbcf3f85c9eb21f7'), uuid: UUID(\"e97a6bd1-498d-4dbf-8477-d77 190fb744b\"), min: { _id: MinKey }, max: { _id: 74 }, shard: \"dst-sh01\", lastmod: Timestamp(1, 0), onCurrentShardSince: Timestamp(1689626197, 140), history: [ { validAfter: Timestamp(1689626197, 140), shard: \"dst-sh01\" } ] } }, { chunks: { _id: ObjectId('64b5a656bbcf3f85c9eb21f8'),  uuid: UUID(\"e97a6bd1-498d-4dbf-8477-d77190fb744b\"), min: { _id: 74 }, max: { _id: MaxKey }, shard: \"dst-sh02\", lastmod: Timestamp(1, 1), onCurrentShardSince: Timestamp(1689626197, 140), history: [ { validAfter: Timestamp(1689626197, 140), shard: \"dst-sh02\" } ] } } ], id: 0, ns:  \"config.collections\", atClusterTime: Timestamp(1689626198, 120) }, ok: 1.0, $clusterTime: { clusterTime: Timestamp(1689626198, 134), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configTime: Timestamp(1689626198, 120), $topologyTime: Timestam p(1689626063, 2), operationTime: Timestamp(1689626198, 120) } target: localhost:28121 status: OK elapsedMicros: 2900 μs moreToCome: false"}} The delete command is shown in the mongos line: {"t":{"$date":"2023-07-17T20:36:38.986+00:00"},"s":"D4", "c":"ASIO", "id":22596, "ctx":"conn259","msg":"startCommand","attr":{"request":"RemoteCommand 12610 -- target:[localhost:28028] db:testDB expDate:2023-07-17T20:41:38.990+00:00 cmd:{ delete: \"testColl2\", bypassDocumentVali dation: false, ordered: false, collectionUUID: UUID(\"e97a6bd1-498d-4dbf-8477-d77190fb744b\"), deletes: [ { q: { $and: [ { $expr: { $gte: [ \"$_id\", { $literal: 0 } ] } }, { $expr: { $lt: [ \"$_id\", { $literal: 99 } ] } } ] }, limit: 0, hint: { _id: 1 } } ], shardVersion: { e: Object Id('00000000ffffffffffffffff'), t: Timestamp(4294967295, 4294967295), v: Timestamp(0, 0) }, writeConcern: { w: \"majority\", j: true, wtimeout: 120000 }, lsid: { id: UUID(\"6e1d23ea-ac74-484f-ab44-341b08fcbfac\"), uid: BinData(0, FFEE4C8F085C89ED4D839F06F64A1B5513D7CA6A5BC5F3EE7052E138 5F4965D3) } }"}} This command is being sent to all three shards, include dst-sh03 at localhost:28028 which doesn't have any testColl2 chunks. The commands that are sent to dst-sh01 and dst-sh02 return without any errors. The command sent to dst-sh03 returns with a CollectionUUIDMismatch error: {"t":{"$date":"2023-07-17T20:36:38.987+00:00"},"s":"D2", "c":"ASIO", "id":22597, "ctx":"conn259","msg":"Request finished with response","attr":{"requestId":12610,"isOK":true,"response":"{ n: 0, electionId: ObjectId('7fffffff0000000000000002'), opTime: { ts: Timestamp(1689626198, 76), t: 2 }, writeErrors: [ { index: 0, code: 361, errmsg: \"Collection UUID does not match that specified\", db: \"testDB\", collectionUUID: UUID(\"e97a6bd1-498d-4dbf-8477-d77190fb744b\"), expectedCollection: \"testColl2\", actualCollection: null } ], ok: 1.0, $clusterTime: { clusterTime: Timestamp(1689626198, 134), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configTime: Timestamp(1689626198, 120), $topologyTime: Timestamp(1689626063, 2), operationTime: Timestamp(1689626198, 76) }"}} The expectedCollection is "testColl2" and the actualCollection is null. However, when this response gets to Mongosync the error becomes {\"index\": {\"$numberInt\":\"0\"},\"code\": {\"$numberInt\":\"361\"},\"errmsg\": \"Collection UUID does not match that specified\",\"db\": \"testDB\",\"collectionUUID\": {\"$binary\":{\"base64\":\"6Xpr0UmNTb+Ed9dxkPt0Sw==\",\"subType\":\"04\"}},\"expectedCollection\": \"testColl2\",\"actualCollection\": \"testColl2\"}"} where expectedCollection and actualCollection are both "testColl2". Mongosync was not expecting to receive a CollectionUUIDMismatch error at all since testColl2 exists. Since the returned actual collection name is the same as the expected collection name, the delete command is retried without changing the expected collection name. This results in the same CollectionUUIDMismatch error. In this test, Mongosync retries 5 times before giving up and erroring out. This might be linked to SERVER-76624, and that its is important for mongosync that we get this resolved.

Top User Comments

JIRAUSER1264163 commented on Fri, 9 Feb 2024 16:41:29 +0000: The problem here is a somewhat tricky one. The problem here is that in principle the delete gets broadcast to all shards if it would target more than 1 shard. This dates back to 10 years ago as introduced by SERVER-13448. As a result, we refresh the targeter which tells us not to send a request to the shard that doesn't contain data. But then we override it so that we send it to all anyway. As a result, we send the request to the shard which expectedly returns a UUID mismatch error with actualCollection as null. This gets converted in mongos to the actual collection and we end up with the expected == actual error. The correct thing to do would be to investigate whether this condition still makes sense today with the current infrastructure surrounding the shard versioning system we have. haley.connelly commented on Thu, 14 Sep 2023 13:36:37 +0000: I spoke with jordi.serra-torrens@mongodb.com and we agree this should be tackled by sharding.  I've also attached simpleRepro.js which reproduces the behavior. JIRAUSER1272935 commented on Wed, 19 Jul 2023 16:31:13 +0000: gregory.noma@mongodb.com do you mean that using collectionUUID option in a write that might broadcast to all shards doesn't make sense? Mongosync must use the collectionUUID option in a delete many command because otherwise we might delete documents from the wrong collection following a collection rename. We are open to explore workarounds that don't require the use of delete many though. Having the server reconcile the CollectionUUIDMismatch internally would be the ideal resolution for Mongosync. JIRAUSER1257318 commented on Wed, 19 Jul 2023 14:47:39 +0000: Is there a reason we need to target all shards here, as opposed to only shards which own chunks for the collection? The problem is that when a router broadcasts writes to shards, it attaches ShardVersion::IGNORED instead of the proper shard version. The reason is two-fold: (i) If the write is non-idempontent we want to avoid retrying it if one shard throws StaleConfig; (ii) large multi-writes may not converge (i.e. finish before some migration commits and changes the ShardVersion). Since ShardVersion::IGNORED means that shards will not inform the router that its routing table was stale, the router does the most pessimistic thing: broadcast the operation to all shards in the cluster (not only those that the router believes have chunks for the collection) JIRAUSER1272935 commented on Wed, 19 Jul 2023 14:32:28 +0000: Maybe chunk migration is a reason why we potentially need to target all shards including those that don't own any chunks. gregory.noma commented on Wed, 19 Jul 2023 13:55:40 +0000: 1. Client executed a delete with expectedUUID, limit=0 and not targeted by exact shard key equality query. Therefore, the router decides it needs to target ALL shards – even those that do not have any chunks (such as sh3) Hmm yeah seems like collectionUUID wouldn't work correctly in this case. Is there a reason we need to target all shards here, as opposed to only shards which own chunks for the collection? JIRAUSER1264979 commented on Tue, 18 Jul 2023 17:17:04 +0000: Note, based on our initial investigation, it seems that this behavior may not be self correcting JIRAUSER1257318 commented on Tue, 18 Jul 2023 17:02:56 +0000: The following happened: 1. Client executed a delete with expectedUUID, limit=0 and not targeted by exact shard key equality query. Therefore, the router decides it needs to target ALL shards – even those that do not have any chunks (such as sh3) 2. Since sh3 does not own any chunk of the collection, it doesn't even have it created locally. Therefore, when it executes its part of the delete, the expectedUUID check does not find any local collection with that uuid so it will fail with CollectionUUIDMismatch and report expectedNs=null. 3. The router catches this error and populates the expectedNamespace by asking the db-primary shard for the nss associated to the wanted uuid. Finally, the CollectionUUIDMismatch error returned to the client has the actualNss == expectedNss JIRAUSER1264979 commented on Tue, 18 Jul 2023 15:16:36 +0000: Relinking the task logs here: https://spruce.mongodb.com/task/mongosync_amazon2_arm64_e2e_unlike_sc2_sc3_patch_4b82b7341dbe0852e893d027df22223f50edc875_64b5a54ea4cf478fdde42fc4_23_07_17_20_32_16/tests?execution=0&sortBy=STATUS&sortDir=ASC. The mongod and mongos logs are available in the files tab, with higher verbosity. If repro instructions are needed, they can be provided.

Steps to Reproduce


Additional Resources / Links

Share:

BugZero® Risk Score

What's this?

Coming soon

Status

In Code Review

Learn More

Search:

...