BugZero found this defect 24 days ago.
Data sources
All data on this page is proprietary to BugZero® or gathered from public sources
4/25/2024
MongoDB Server
No affected releases provided.
No fixed releases provided.
Change streams use a number of internal events, such as migrateLastChunkFromShard and migrateChunkToNewShard, to detect cluster-level topology changes. These are bubbled up to mongoS and consumed by an internal stage; the events are not surfaced to the client. However, it is conceivable that under certain circumstances, the resume token of that internal event may become the PBRT of the next batch returned to the client. If the user attempts to resume from that token, the DSCSHandleToplogyChange stage would again swallow the event before DSCSEnsureResumeTokenPresent has a chance to see it, causing the latter stage to believe that the event did not appear in the resumed stream and resulting in ChangeStreamFatalError being thrown to the client.
bernard.gorman commented on Thu, 25 Apr 2024 21:01:54 +0000: An initial attempt to repro this issue failed, but in theory the conditions for it to manifest would be that either: The internal event is the last event in its shard's oplog at the moment the getMore running on that shard returns the event to mongoS, so that the internal event's resume token becomes the shard's minPromisedSortKey, OR The internal event sorts immediately before a client-facing event on the same shard or another shard, and the client-facing event ends up being stashed due to exceeding the maximum mongoS batch size. In both cases, the mongoS batch's PBRT cannot advance beyond the resume token of the internal event, despite the fact that the event itself is not returned. The simplest fix for this issue, once it can be reproduced, would likely be to swap the positions of DSCSEnsureResumeTokenPresent and DSCSHandleToplogyChange in the pipeline. Placing DSCSEnsureResumeTokenPresent first would ensure that the internal event is correctly observed in the resumed pipeline in cases where the user resumes from the internal event's PBRT. The main reason for these stages' current placement is so that DSCSHandleToplogyChange has access to the DSMergeCursors stage which immediately precedes it, but it would be fairly straightforward to refactor.