Operational Defect Database

BugZero found this defect 24 days ago.

MongoDB | 2653519

Investigate potential leak of internal resume tokens via PBRT

Last update date:

4/25/2024

Affected products:

MongoDB Server

Affected releases:

No affected releases provided.

Fixed releases:

No fixed releases provided.

Description:

Info

Change streams use a number of internal events, such as migrateLastChunkFromShard and migrateChunkToNewShard, to detect cluster-level topology changes. These are bubbled up to mongoS and consumed by an internal stage; the events are not surfaced to the client. However, it is conceivable that under certain circumstances, the resume token of that internal event may become the PBRT of the next batch returned to the client. If the user attempts to resume from that token, the DSCSHandleToplogyChange stage would again swallow the event before DSCSEnsureResumeTokenPresent has a chance to see it, causing the latter stage to believe that the event did not appear in the resumed stream and resulting in ChangeStreamFatalError being thrown to the client.

Top User Comments

bernard.gorman commented on Thu, 25 Apr 2024 21:01:54 +0000: An initial attempt to repro this issue failed, but in theory the conditions for it to manifest would be that either: The internal event is the last event in its shard's oplog at the moment the getMore running on that shard returns the event to mongoS, so that the internal event's resume token becomes the shard's minPromisedSortKey, OR The internal event sorts immediately before a client-facing event on the same shard or another shard, and the client-facing event ends up being stashed due to exceeding the maximum mongoS batch size. In both cases, the mongoS batch's PBRT cannot advance beyond the resume token of the internal event, despite the fact that the event itself is not returned. The simplest fix for this issue, once it can be reproduced, would likely be to swap the positions of DSCSEnsureResumeTokenPresent and DSCSHandleToplogyChange in the pipeline. Placing DSCSEnsureResumeTokenPresent first would ensure that the internal event is correctly observed in the resumed pipeline in cases where the user resumes from the internal event's PBRT. The main reason for these stages' current placement is so that DSCSHandleToplogyChange has access to the DSMergeCursors stage which immediately precedes it, but it would be fairly straightforward to refactor.

Steps to Reproduce


Additional Resources / Links

Share:

BugZero® Risk Score

What's this?

Coming soon

Status

Backlog

Learn More

Search:

...