Operational Defect Database

BugZero found this defect 2478 days ago.

MongoDB | 383479

[SERVER-29217] request stuck when dirty cache reach 20%

Last update date:

11/15/2021

Affected products:

MongoDB Server

Affected releases:

3.2.10

Fixed releases:

No fixed releases provided.

Description:

Info

mongostat data shown as belows insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time *0 *0 *0 *0 0 12|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 984b 27k 0 mgset-2080137 PRI 2017-05-15T15:21:29+08:00 *0 *0 *0 *0 0 3|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 191b 22k 0 mgset-2080137 PRI 2017-05-15T15:21:30+08:00 *0 *0 *0 *0 0 5|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 530b 24k 0 mgset-2080137 PRI 2017-05-15T15:21:31+08:00 *0 *0 *0 *0 0 3|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 192b 22k 0 mgset-2080137 PRI 2017-05-15T15:21:32+08:00 *0 *0 *0 *0 0 5|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 532b 23k 0 mgset-2080137 PRI 2017-05-15T15:21:33+08:00 *0 *0 *0 *0 0 8|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 706b 24k 0 mgset-2080137 PRI 2017-05-15T15:21:34+08:00 *0 *0 *0 *0 0 6|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 648b 25k 0 mgset-2080137 PRI 2017-05-15T15:21:35+08:00 *0 *0 *0 *0 0 25|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 2k 33k 0 mgset-2080137 PRI 2017-05-15T15:21:36+08:00 *0 *0 *0 *0 0 4|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 527b 23k 0 mgset-2080137 PRI 2017-05-15T15:21:37+08:00 *0 *0 *0 *0 0 2|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 133b 22k 0 mgset-2080137 PRI 2017-05-15T15:21:38+08:00 insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time *0 *0 *0 *0 0 8|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 698b 25k 0 mgset-2080137 PRI 2017-05-15T15:21:39+08:00 *0 *0 *0 *0 0 4|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 354b 23k 0 mgset-2080137 PRI 2017-05-15T15:21:40+08:00 *0 *0 *0 *0 0 11|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 1k 27k 0 mgset-2080137 PRI 2017-05-15T15:21:41+08:00 *0 *0 *0 *0 0 2|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 134b 22k 0 mgset-2080137 PRI 2017-05-15T15:21:42+08:00 *0 *0 *0 *0 0 5|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 532b 23k 0 mgset-2080137 PRI 2017-05-15T15:21:43+08:00 *0 *0 *0 *0 0 3|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 189b 22k 0 mgset-2080137 PRI 2017-05-15T15:21:44+08:00 *0 *0 *0 *0 0 4|0 20.1 79.6 0 24.9G 16.3G 1105|34 128|128 474b 23k 0 mgset-2080137 PRI 2017-05-15T15:21:45+08:00 Through the pstack result,I find that __wt_evict_thread_run never work even though dirty cache ratio reached 20%. After I changed eviction_dirty_trigger to 30, the server recovered to normal state immediately.   db.runCommand({setParameter: 1, wiredTigerEngineRuntimeConfig: "eviction_dirty_trigger=30"})db.runCommand({setParameter: 1, wiredTigerEngineRuntimeConfig: "eviction_dirty_trigger=30"})

Top User Comments

thomas.schubert commented on Mon, 12 Jun 2017 16:57:04 +0000: Hi zyd_com, An examination of the diagnostic.data you've uploaded indicates that this is a custom, unsupported build. Unfortunately, since this build contains modifications to the source code, we cannot debug it. If you're able to reproduce this issue using a standard build, we'll be able to continue to investigate. Thank you, Thomas michael.cahill commented on Tue, 16 May 2017 06:28:02 +0000: zyd_com, I've looked through the data you uploaded. This node was in fact running MongoDB 3.2.9, and was a secondary. I don't see large numbers of queued operations: I assume they were on the primary? Looking at the eviction statistics, I think it would be worth trying a more recent release of MongoDB such as 3.2.13, which includes many improvements in this area. Is it possible for you to test 3.2.13, even if that is on a hidden secondary? michael_liuxin, you are correct that WiredTiger's eviction code relies on an eviction server thread to keep the queues full so that other threads can make progress. If something blocks the eviction server thread, that could cause the behavior reported here where operations stall. This is one of the areas we have improved in recent releases of MongoDB, which is why I think it would be good to test with the latest version to see whether this issue has been resolved. michael_liuxin commented on Tue, 16 May 2017 05:41:03 +0000: @Thomas Schubert There has a analogous stuck problem with @Youdong Zhang. And I reproduce it with smaller cacheSize setting in wiredtiger-2.8.1 in MongoDB v3.2.10. deep into the evict_server/evict_worker's code I have some clues. I'm not quite sure that may help you, but may be useful I think. 1). in the pstack.rtf file, much _evict_page from txn_open(application thread) or __evict_lru_pages(evict worker) block on mutex, and the others call __wt_reconcile() and __rec_row_leaf(). And I see lots threads are keep waiting on __wt_cache_eviction_worker()'s conn->evict_threads.wait_cond(mostly. this means application thread is waiting evict_server's __evict_lru_walk to wake up). 2). In my scenario. GDB breaks at __evict_lru_pages() and It shows that the evict_worker invoke __evict_page() and __wt_reconcile() inside always returned EBUSY. so the evict_worker retry it more times. And I can't find a evict_server appearing(evict_worker isn't gonna compete to be a evict_server). in evict_lru.c:861: /* * Reconcile and discard some pages: EBUSY is returned if a page fails * eviction because it's unavailable, continue in that case. */ while (F_ISSET(S2C(session), WT_CONN_EVICTION_RUN) && ret == 0) if ((ret = __evict_page(session, is_server)) == EBUSY) ret = 0; 3). If there is no evict_server to wake up application thread. The application thread will be timed out and retry. In evict_lru.c:1797. and application's request looks like be blocked. And there is still a problem in my scenario. why __wt_reconcile() -> __rec_row_leaf_insert() -> __rec_txn_read() will return EBUSY ??.... I found only two statements would do that : if (!F_ISSET(r, WT_EVICT_LOOKASIDE | WT_EVICT_UPDATE_RESTORE)) return (EBUSY); if (skipped && !F_ISSET(r,; WT_EVICT_UPDATE_RESTORE)) return (EBUSY); This problem may be similar to @Youdong Zhang I guess. But I'm not familiar the code inside __rec_txn_read() very clearly .... and hope for your help.. Thanks zyd_com@126.com commented on Tue, 16 May 2017 03:16:33 +0000: @Thomas Schubert I have uploaded the diagnostic.data file. thomas.schubert commented on Mon, 15 May 2017 19:31:16 +0000: Hi zyd_com, So we can investigate, would you please provide an archive of the diagnostic.data and complete log files for the affected mongod? I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time. Thank you, Thomas

Additional Resources / Links

Share:

BugZero Risk Score

Coming soon

Status

Closed

Have you been affected by this bug?

cost-cta-background

Do you know how much operational outages are costing you?

Understand the cost to your business and how BugZero can help you reduce those costs.

Discussion

Login to read and write comments.

Have you ever...

had your data corrupted from a

VMware

bug?

Search:

...