Operational Defect Database

BugZero found this defect 2433 days ago.

MongoDB | 399222

[SERVER-29918] stemming behavior for diacritics causes incorrect results

Last update date:

10/27/2023

Affected products:

MongoDB Server

Affected releases:

3.4.4

Fixed releases:

No fixed releases provided.

Description:

Info

$text search is not diacritic insensitive if the word contains a dieresis ( ¨ ). Dieresis is categorized as diacritic in Unicode 8.0 Character Database Prop List, cf http://www.unicode.org/Public/8.0.0/ucd/PropList.txt Search with collation works fine with strength = 1

Top User Comments

kyle.suarez commented on Thu, 9 Nov 2017 21:49:04 +0000: I've taken another look at the issue here and thoroughly examined what happens with regard to stemming and diacritic stripping. As Dan mentions, the stemmer must be diacritic-sensitive because diacritics affect stemming, even in languages like English. For example, in English: "resume" is stemmed to "resum", as you'd expect. Its conjugated forms "resumed", "resuming", etc. all have the same stem. "résumé" is stemmed simply to itself, as it is a noun and has no simpler form. The current text search engine is written in a way that errs on the side of "correctness". That being said, I am definitely sympathetic to the argument that "résumé" is commonly spelled as "resume" in everyday speech. However, changing the way text search works with regard to stemming and diacritic stripping will require a much larger project and detailed design. Based on this assessment, I'm going to close this ticket as Works as Designed. For now, truly diacritic-insensitive queries should use either collation or a text index language of "none" (but will lose out on the benefit of stemming). kyle.suarez commented on Thu, 10 Aug 2017 14:19:46 +0000: It still seems like something is off here, though, like there is an inconsistent approach to the way we perform diacritic stripping and stemming. In any case, I've stopped investigating this ticket as the Query Team's priority is on 3.6 scheduled features. It does seem worth it, though, for someone to investigate this behavior further once we revisit the tickets on the backlog. dan@10gen.com commented on Thu, 3 Aug 2017 17:59:37 +0000: The stemmer is diacritic sensitive and it must be because accents have meaning in some languages. See this comment: https://github.com/mongodb/mongo/blob/r3.5.10/src/mongo/db/fts/fts_unicode_tokenizer.cpp#L96 kyle.suarez commented on Tue, 25 Jul 2017 19:51:26 +0000: Good point... I tried std::cout << "Stemmed version of iphone: " << s.stem("iphone") << std::endl; std::cout << "Stemmed version of iphoné: " << s.stem("iphoné") << std::endl; std::cout << "Stemmed version of iphonë: " << s.stem("iphonë") << std::endl; and got Stemmed version of iphone: iphon Stemmed version of iphoné: iphoné Stemmed version of iphonë: iphonë I'm putting this ticket into Needs Triage, so that the query team can triage this ticket at the next planning meeting. Whoever picks up this ticket should look at the places where stemming happens and see if we can strip diacritics before it occurs. thomas.schubert commented on Tue, 25 Jul 2017 19:46:36 +0000: I'd argue the problem isn't that we stem "iphone" to "iphon" (the stem doesn't have to be a valid root), but that we don't stem "iphoné" to "iphon". If our stemming isn't diacritic insensitive, our queries can't be. Can we change "iphoné" to "iphone" before it reaches the stemmer so it generates the same root? kyle.suarez commented on Tue, 25 Jul 2017 19:18:58 +0000: After some investigation, I've found that the problem is not a diacritic problem, but a stemming problem. In your text search, the default language is English. Unfortunately, our vendored third-party stemming library, libstemmer.c, stems the word "iphone" into "iphon" when in English mode. Thus, it cuts off the "e" completely and is not included in the search. When changing the language to "none", stemming does not occur, and I find the results as usual. > db.text.find() { "_id" : ObjectId("59778fac798c05e256b74092"), "t" : "iphone" } { "_id" : ObjectId("59778faf798c05e256b74093"), "t" : "iphoné" } { "_id" : ObjectId("59778fb2798c05e256b74094"), "t" : "iphonë" }   > db.text.find({$text: {$search: "iphone"}}) { "_id" : ObjectId("59778fac798c05e256b74092"), "t" : "iphone" }   > db.text.find({$text: {$search: "iphone", $language: "none"}}) { "_id" : ObjectId("59778ef3798c05e256b74086"), "t" : "iphonë" } { "_id" : ObjectId("59778faf798c05e256b74093"), "t" : "iphoné" } { "_id" : ObjectId("59778fb2798c05e256b74094"), "t" : "iphonë" } ian@10gen.com commented on Fri, 14 Jul 2017 14:43:28 +0000: Reminder: kyle.suarez please review to see if you can find the underlying cause. thomas.schubert commented on Thu, 29 Jun 2017 15:22:57 +0000: Hi felix2626, Thank you for reporting this issue. I've marked this ticket to be scheduled against currently planned work. Please continue to watch this ticket for updates. Kind regards, Thomas

Additional Resources / Links

Share:

BugZero Risk Score

Coming soon

Status

Closed

Have you been affected by this bug?

cost-cta-background

Do you know how much operational outages are costing you?

Understand the cost to your business and how BugZero can help you reduce those costs.

Discussion

Login to read and write comments.

Have you ever...

had your data corrupted from a

VMware

bug?

Search:

...