Operational Defect Database

BugZero found this defect 2500 days ago.

MongoDB | 375771

[SERVER-28877] Cancel race in replication executor can cause elections to assert.

Last update date:

10/30/2023

Affected products:

MongoDB Server

Affected releases:

3.4.3

Fixed releases:

3.2.15

3.4.6

3.5.8

Description:

Info

Output from mongodb.log: 2017-04-19T13:15:23.807Z I COMMAND [conn50682] command local.oplog.rs command: getMore { getMore: 16914237176, collection: "oplog.rs", maxTimeMS: 5000, term: 161, lastKnownCommittedOpTime: { ts: Timestamp 1492607700000|2, t: 161 } } originatingCommand: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1492156061000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 161 } planSummary: COLLSCAN cursorid:16914237176 keysExamined:0 docsExamined:0 numYields:1 nreturned:0 reslen:292 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { r: 3 } } } protocol:op_command 6625ms 2017-04-19T13:15:46.612Z I COMMAND [conn51155] command local.oplog.rs command: getMore { getMore: 16914237176, collection: "oplog.rs", maxTimeMS: 5000, term: 161, lastKnownCommittedOpTime: { ts: Timestamp 1492607722000|1, t: 161 } } originatingCommand: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1492156061000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 161 } planSummary: COLLSCAN cursorid:16914237176 keysExamined:0 docsExamined:0 numYields:0 nreturned:0 reslen:292 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, oplog: { acquireCount: { r: 2 } } } protocol:op_command 17700ms 2017-04-19T13:15:46.614Z I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms 2017-04-19T13:15:46.614Z I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected 2017-04-19T13:15:46.614Z I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mdbfh:27017 2017-04-19T13:15:46.614Z I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mdb1:27017 2017-04-19T13:15:46.614Z I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mdbarb:27017 2017-04-19T13:15:46.656Z I NETWORK [conn51178] received client metadata from 37.120.167.74:33775 conn51178: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.2" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } } 2017-04-19T13:15:46.670Z I NETWORK [conn51179] received client metadata from 84.39.96.207:38332 conn51179: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.3" }, os: { type: "Linux", n ame: "PRETTY_NAME="Debian GNU/Linux 8 (jessie)"", architecture: "x86_64", version: "Kernel 2.6.32-042stab120.18" } } 2017-04-19T13:15:46.740Z I REPL [ReplicationExecutor] Error in heartbeat request to mdb1:27017; ExceededTimeLimit: Remote command timed out while waiting to get a connection from the pool, took 12407ms, timeout was set to 10000ms 2017-04-19T13:15:46.742Z I REPL [ReplicationExecutor] Error in heartbeat request to mdbfh:27017; ExceededTimeLimit: Remote command timed out while waiting to get a connection from the pool, took 12532ms, timeout was set to 10000ms 2017-04-19T13:15:46.742Z I REPL [ReplicationExecutor] Error in heartbeat request to mdbarb:27017; ExceededTimeLimit: Remote command timed out while waiting to get a connection from the pool, took 12533ms, timeout was set to 10000ms 2017-04-19T13:15:46.938Z I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mdb1:27017 2017-04-19T13:15:46.942Z I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mdbfh:27017 2017-04-19T13:15:46.942Z I REPL [ReplicationExecutor] VoteRequester(term 161 dry run) received a no vote from mdb1:27017 with reason "candidate's data is staler than mine"; response message: { term: 161, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 } 2017-04-19T13:15:46.963Z I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mdbarb:27017 2017-04-19T13:15:46.975Z I REPL [ReplicationExecutor] VoteRequester(term 161 dry run) received a no vote from mdbarb:27017 with reason "can see a healthy primary of equal or greater priority"; response message: { term: 161, voteGranted: false, reason: "can see a healthy primary of equal or greater priority", ok: 1.0 } 2017-04-19T13:15:46.996Z I - [ReplicationExecutor] Invariant failure !_callbacks.empty() src/mongo/db/repl/scatter_gather_runner.cpp 149 2017-04-19T13:15:46.997Z I - [ReplicationExecutor]   ***aborting after invariant() failure   2017-04-19T13:15:47.344Z F - [ReplicationExecutor] Got signal: 6 (Aborted).   0x7f12b5fb2981 0x7f12b5fb1a79 0x7f12b5fb1f5d 0x7f12b362f330 0x7f12b328dc37 0x7f12b3291028 0x7f12b5243b9c 0x7f12b5a6b654 0x7f12b5a31ba1 0x7f12b5a2e719 0x7f12b5a36784 0x7f12b6a24510 0x7f12b3627184 0x7f12b3354bed ----- BEGIN BACKTRACE ----- {"backtrace":[{"b":"7F12B4A34000","o":"157E981","s":"_ZN5mongo15printStackTraceERSo"},{"b":"7F12B4A34000","o":"157DA79"},{"b":"7F12B4A34000","o":"157DF5D"},{"b":"7F12B361F000","o":"10330"},{"b":"7F12B3257000","o":"36C37","s":"gsignal"},{"b":"7F12B3257000","o":"3A028","s":"abort"},{"b":"7F12B4A34000","o":"80FB9C","s":"_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j"},{"b":"7F12B4A34000","o":"1037654","s":"_ZN5mongo4repl19ScatterGatherRunner10RunnerImpl15processResponseERKNS_8executor12TaskExecutor25RemoteCommandCallbackArgsE"},{"b":"7F12B4A34000","o":"FFDBA1"},{"b":"7F12B4A34000","o":"FFA719"},{"b":"7F12B4A34000","o":"1002784","s":"_ZN5mongo4repl19ReplicationExecutor3runEv"},{"b":"7F12B4A34000","o":"1FF0510"},{"b":"7F12B361F000","o":"8184"},{"b":"7F12B3257000","o":"FDBED","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.3", "gitVersion" : "f07437fb5a6cca07c10bafa78365456eb1d6d5e1", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.13.0-112-generic", "version" : "#159-Ubuntu SMP Fri Mar 3 15:26:07 UTC 2017", "machine" : "x86_64" }, "somap" : [ { "b" : "7F12B4A34000", "elfType" : 3, "buildId" : "CE26BD3A6C680DD87D137892B07C2EEFC1162144" }, { "b" : "7FFFC5A9B000", "elfType" : 3, "buildId" : "012E1338BA43AF7C0DC7D069F64F0A6490CC6D9C" }, { "b" : "7F12B45A7000", "path" : "/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "58087C35580286708FF6469B51886E5FF5740101" }, { "b" : "7F12B4165000", "path" : "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "1C949DBD43FCE50C9B691B7EA8F31229891B2218" }, { "b" : "7F12B3F5D000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "1EEBA762A6A2C8884D56033EE8CCE79B95CD974D" }, { "b" : "7F12B3D59000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "D0F881E59FF88BE4F29A228C8657376B3C325C2C" }, { "b" : "7F12B3A53000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1654CB13B1D24ED03F4BDCB51FC7524B9181A771" }, { "b" : "7F12B383D000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "36311B4457710AE5578C4BF00791DED7359DBB92" }, { "b" : "7F12B361F000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "22F9078CFA529CCE1A814A4A1A1C018F169D5652" }, { "b" : "7F12B3257000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "CA5C6CFE528AF541C3C2C15CEE4B3C74DA4E2FB4" }, { "b" : "7F12B480F000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "237E22E5AAC2DDFCD06518F63FD720FE758E6E5B" } ] }} mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x7f12b5fb2981] mongod(+0x157DA79) [0x7f12b5fb1a79] mongod(+0x157DF5D) [0x7f12b5fb1f5d] libpthread.so.0(+0x10330) [0x7f12b362f330] libc.so.6(gsignal+0x37) [0x7f12b328dc37] libc.so.6(abort+0x148) [0x7f12b3291028] mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0x0) [0x7f12b5243b9c] mongod(_ZN5mongo4repl19ScatterGatherRunner10RunnerImpl15processResponseERKNS_8executor12TaskExecutor25RemoteCommandCallbackArgsE+0x254) [0x7f12b5a6b654] mongod(+0xFFDBA1) [0x7f12b5a31ba1] mongod(+0xFFA719) [0x7f12b5a2e719] mongod(_ZN5mongo4repl19ReplicationExecutor3runEv+0x274) [0x7f12b5a36784] mongod(+0x1FF0510) [0x7f12b6a24510] libpthread.so.0(+0x8184) [0x7f12b3627184] libc.so.6(clone+0x6D) [0x7f12b3354bed] ----- END BACKTRACE ----- My Replica Setup: 4 Member: 1 Arbiter with default prio 1 3 "Worker" with prio 1, 2, 2

Top User Comments

xgen-internal-githook commented on Thu, 15 Jun 2017 16:30:50 +0000: Author: {u'username': u'mtrussotto', u'name': u'Matthew Russotto', u'email': u'matthew.russotto@10gen.com'} Message: SERVER-28877 Fix cancel race which can cause crashes during elections. (cherry picked from commit 7844ffc9be25e16daa9615949941b8749ad3de9f) Branch: v3.2 https://github.com/mongodb/mongo/commit/daa09e3bfa5382cfc51de3f339a486cfc8aed219 xgen-internal-githook commented on Wed, 14 Jun 2017 15:58:01 +0000: Author: {u'username': u'mtrussotto', u'name': u'Matthew Russotto', u'email': u'matthew.russotto@10gen.com'} Message: SERVER-28877 Fix cancel race which can cause crashes during elections. (cherry picked from commit 7844ffc9be25e16daa9615949941b8749ad3de9f) Branch: v3.4 https://github.com/mongodb/mongo/commit/ca14d6a53664d228d0115be55a33f5fcd11d7fd2 xgen-internal-githook commented on Thu, 18 May 2017 17:05:28 +0000: Author: {u'username': u'mtrussotto', u'name': u'Matthew Russotto', u'email': u'matthew.russotto@10gen.com'} Message: SERVER-28877 Fix cancel race which can cause crashes during elections. Branch: master https://github.com/mongodb/mongo/commit/7844ffc9be25e16daa9615949941b8749ad3de9f bodenhaltung commented on Thu, 20 Apr 2017 20:37:17 +0000: Sure, please see attached the rs.status() output. spencer commented on Thu, 20 Apr 2017 18:16:02 +0000: Hi Bodenhaltung, We are looking through the logs and code to determine if we can identify the source of the crash. In the meantime, could you please attach the replica set config in use to aid in our diagnostics?. This can be obtained by running rs.status() in the shell when connected to one of the nodes in the set. Thanks, -Spencer bodenhaltung commented on Thu, 20 Apr 2017 09:42:57 +0000: One member runs on "Debian GNU/Linux 8.7", but all with the same mongod Version.

Additional Resources / Links

Share:

BugZero Risk Score

Coming soon

Status

Closed

Have you been affected by this bug?

cost-cta-background

Do you know how much operational outages are costing you?

Understand the cost to your business and how BugZero can help you reduce those costs.

Discussion

Login to read and write comments.

Have you ever...

had your data corrupted from a

VMware

bug?

Search:

...