Operational Defect Database

BugZero found this defect 2504 days ago.

MongoDB | 375396

[SERVER-28857] Strange election on network failure

Last update date:

10/27/2023

Affected products:

MongoDB Server

Affected releases:

3.4.3

Fixed releases:

No fixed releases provided.

Description:

Info

The subject replicaset has 3 nodes (see rs.conf() below). t1 IP address is 10.3.1.12 t2 IP address is 10.3.1.13 t3 IP address is 10.3.1.16 After a transient network failure (switch ports were disabled and enabled back) on the secondary (t3) it became primary, causing rollbacks on the previous primary (t1) and other secondary (t2). All writes are done with w:majority, so this is really strange. Logs from all three machines are attached. rs.conf() { "_id" : "driveFS-temp-1", "version" : 4, "protocolVersion" : NumberLong(1), "writeConcernMajorityJournalDefault" : false, "members" : [ { "_id" : 0, "host" : "t1.s1.fs.drive.bru:27231", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : {   }, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 1, "host" : "t2.s1.fs.drive.bru:27231", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : {   }, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 2, "host" : "t3.s1.fs.drive.bru:27231", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : {   }, "slaveDelay" : NumberLong(0), "votes" : 1 } ], "settings" : { "chainingAllowed" : true, "heartbeatIntervalMillis" : 2000, "heartbeatTimeoutSecs" : 10, "electionTimeoutMillis" : 5000, "catchUpTimeoutMillis" : 2000, "getLastErrorModes" : {   }, "getLastErrorDefaults" : { "w" : 1, "wtimeout" : 0 }, "replicaSetId" : ObjectId("58c9657b40aba377920b23f2") } }

Top User Comments

onyxmaster commented on Mon, 1 May 2017 15:47:10 +0000: Thank you for the information. I was more surprised that election allowed a secondary to be elected as primary when primary was available and connected to the other secondary. Well, since this preserves the acknowledged majority writes it's okay. thomas.schubert commented on Mon, 1 May 2017 02:37:17 +0000: Hi onyxmaster, After reviewing the logs, there is no indication of a bug during this failover. While, w : majority acknowledges writes will not be rolled back, writes written with this write concern that have not been acknowledged to the application are liable to be rolled back on failover. In this case, it appears that writes were completed on the secondary, but the rest of the replicaset (and application by extension) was not yet aware that these writes had been completed. Consequently, the secondary and old primary rolled back on fail over. Kind regards, Thomas thomas.schubert commented on Wed, 19 Apr 2017 14:25:42 +0000: Hi onyxmaster, Thank you for the detailed report and logs, we're investigating this behavior and will update this ticket after we've finished reviewing the logs. Kind regards, Thomas

Additional Resources / Links

Share:

BugZero Risk Score

Coming soon

Status

Closed

Have you been affected by this bug?

cost-cta-background

Do you know how much operational outages are costing you?

Understand the cost to your business and how BugZero can help you reduce those costs.

Discussion

Login to read and write comments.

Have you ever...

had your data corrupted from a

VMware

bug?

Search:

...