Operational Defect Database

BugZero found this defect 2493 days ago.

MongoDB | 377944

[SERVER-28993] Duplicate key error returned after write concern timeout instead of TimeoutException

Last update date:

7/29/2017

Affected products:

MongoDB Server

Affected releases:

3.4.2

Fixed releases:

No fixed releases provided.

Description:

Info

Replica set with 2 nodes and an arbiter with one node down When performing a w:"majority" write that would otherwise fail with Duplicate Key Exception, I expect it to fail with Write Concern Timeout exception after the timeout has expired. However, it still produces the dup key exception after the wait: replset:PRIMARY> db.test.insertOne({_id:1}) { "acknowledged" : true, "insertedId" : 1 } replset:PRIMARY> db.test.insertOne({_id:1},{writeConcern:{w:"majority",wtimeout:5000}}) ... WAIT ... 2017-04-26T18:09:57.685-0400 E QUERY [thread1] WriteError: E11000 duplicate key error collection: test.test index: _id_ dup key: { : 1.0 } : WriteError({ "index" : 0, "code" : 11000, "errmsg" : "E11000 duplicate key error collection: test.test index: _id_ dup key: { : 1.0 }", "op" : { "_id" : 1 } }) WriteError@src/mongo/shell/bulk_api.js:469:48 Bulk/mergeBatchResults@src/mongo/shell/bulk_api.js:836:49 Bulk/executeBatch@src/mongo/shell/bulk_api.js:906:13 Bulk/this.execute@src/mongo/shell/bulk_api.js:1150:21 DBCollection.prototype.insertOne@src/mongo/shell/crud_api.js:242:9 @(shell):1:1 replset:PRIMARY> rs.status() { "set" : "replset", "date" : ISODate("2017-04-26T22:10:05.449Z"), "myState" : 1, "term" : NumberLong(3), "heartbeatIntervalMillis" : NumberLong(2000), "optimes" : { "lastCommittedOpTime" : { "ts" : Timestamp(1493242117, 1), "t" : NumberLong(1) }, "appliedOpTime" : { "ts" : Timestamp(1493244598, 1), "t" : NumberLong(3) }, "durableOpTime" : { "ts" : Timestamp(1493244598, 1), "t" : NumberLong(3) } }, "members" : [ { "_id" : 0, "name" : "AD-MAC10G.local:27017", "health" : 0, "state" : 8, "stateStr" : "(not reachable/healthy)", "uptime" : 0, "optime" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "optimeDurable" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "optimeDate" : ISODate("1970-01-01T00:00:00Z"), "optimeDurableDate" : ISODate("1970-01-01T00:00:00Z"), "lastHeartbeat" : ISODate("2017-04-26T22:10:04.105Z"), "lastHeartbeatRecv" : ISODate("2017-04-26T21:28:42.092Z"), "pingMs" : NumberLong(0), "lastHeartbeatMessage" : "Connection refused", "configVersion" : -1 }, { "_id" : 1, "name" : "AD-MAC10G.local:27018", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 2740, "optime" : { "ts" : Timestamp(1493244598, 1), "t" : NumberLong(3) }, "optimeDate" : ISODate("2017-04-26T22:09:58Z"), "infoMessage" : "could not find member to sync from", "electionTime" : Timestamp(1493244557, 1), "electionDate" : ISODate("2017-04-26T22:09:17Z"), "configVersion" : 1, "self" : true }, { "_id" : 2, "name" : "AD-MAC10G.local:27019", "health" : 1, "state" : 7, "stateStr" : "ARBITER", "uptime" : 57, "lastHeartbeat" : ISODate("2017-04-26T22:10:03.982Z"), "lastHeartbeatRecv" : ISODate("2017-04-26T22:10:02.801Z"), "pingMs" : NumberLong(0), "configVersion" : 1 } ], "ok" : 1 } replset:PRIMARY> This makes it impossible to distinguish the write concern failure.

Top User Comments

spencer commented on Mon, 3 Jul 2017 21:48:03 +0000: Looks like the issue was actually with the shell behavior, not the command behavior. spencer commented on Wed, 17 May 2017 22:49:01 +0000: Actually, I am unable to reproduce this outside of the shell bulk api. alex.komyagin, can you reproduce this using the command api directly? i.e. db.runCommand({insert:'test', documents: [{_id:2}], writeConcern:{w:3, wtimeout:3000}, ordered: true}) spencer commented on Wed, 17 May 2017 21:51:17 +0000: So, hilariously, the difference between ordered and unordered is coming from the shell, not the server. I filed SERVER-29263 for that bug. There may still be a bug in the server where we don't return the write concern error in some places. I'm still investigating the server side of the issue. alex.komyagin@10gen.com commented on Thu, 27 Apr 2017 20:45:13 +0000: Just to add to this, ordered and unordered bulk operations are not consistent in reporting write concern failures after a timeout: replset:PRIMARY> db.test1.bulkWrite([{insertOne:{document:{_id:1}}}],{ordered:false,writeConcern:{w:2,wtimeout:1000}}) 2017-04-27T16:38:17.558-0400 E QUERY [thread1] BulkWriteError: write error at item 0 and problem enforcing write concern in bulk operation : BulkWriteError({ "writeErrors" : [ { "index" : 0, "code" : 11000, "errmsg" : "E11000 duplicate key error collection: test.test1 index: _id_ dup key: { : 1.0 }", "op" : { "_id" : 1 } } ], "writeConcernErrors" : [ { "code" : 64, "codeName" : "WriteConcernFailed", "errInfo" : { "wtimeout" : true }, "errmsg" : "waiting for replication timed out" } ], "nInserted" : 0, "nUpserted" : 0, "nMatched" : 0, "nModified" : 0, "nRemoved" : 0, "upserted" : [ ] }) BulkWriteError@src/mongo/shell/bulk_api.js:372:48 BulkWriteResult/this.toError@src/mongo/shell/bulk_api.js:336:24 Bulk/this.execute@src/mongo/shell/bulk_api.js:1173:1 DBCollection.prototype.bulkWrite@src/mongo/shell/crud_api.js:191:20 @(shell):1:1 replset:PRIMARY> db.test1.bulkWrite([{insertOne:{document:{_id:1}}}],{ordered:true,writeConcern:{w:2,wtimeout:1000}}) 2017-04-27T16:38:39.568-0400 E QUERY [thread1] BulkWriteError: write error at item 0 in bulk operation : BulkWriteError({ "writeErrors" : [ { "index" : 0, "code" : 11000, "errmsg" : "E11000 duplicate key error collection: test.test1 index: _id_ dup key: { : 1.0 }", "op" : { "_id" : 1 } } ], "writeConcernErrors" : [ ], "nInserted" : 0, "nUpserted" : 0, "nMatched" : 0, "nModified" : 0, "nRemoved" : 0, "upserted" : [ ] }) milkie commented on Wed, 26 Apr 2017 23:58:31 +0000: Arguably, however, you should still be able to know of your timeout expired, because otherwise you would assume that your read could see the data even when using read concern majority. This seems like a bug that will be fixed if we merge the w timeout with maxTimeMS. milkie commented on Wed, 26 Apr 2017 23:54:10 +0000: This behavior is by design. You were always destined to get the duplicate key error, regardless of your write concern. However, because you used w:majority, the server has to delay its response to you so that the preexisting data that caused your error can first replicate to a majority of nodes. This is to allow you to immediately issue a read concern level majority read and be able to see the duplicate data that caused the write error.

Additional Resources / Links

Share:

BugZero Risk Score

Coming soon

Status

Closed

Have you been affected by this bug?

cost-cta-background

Do you know how much operational outages are costing you?

Understand the cost to your business and how BugZero can help you reduce those costs.

Discussion

Login to read and write comments.

Have you ever...

had your data corrupted from a

VMware

bug?

Search:

...