JavaScript 中的编程重音减少(又名文本规范化或取消重音)
我需要比较两个字符串是否相等,如下所示:
吕贝克==吕贝克
在 JavaScript 中。
为什么? 好吧,我有一个自动完成字段,它使用 Lucene 发送到 Java 服务,其中地名自然存储(如吕贝克),但也作为规范化文本进行索引,
import sun.text.Normalizer;
oDoc.setNameLC = Normalizer.normalize(oLocName, Normalizer.DECOMP, 0)
.toLowerCase().replaceAll("[^\\p{ASCII}]","");
这样对于不知道输入“的人来说” Mèxico”可以输入“mexico”并获得返回“Mèxico”的匹配(以及许多其他可能的命中,例如“Café Mèxico, Dubai, UAE”)。
现在的问题是,我无法更改服务以在服务器端进行任何突出显示,因此我在客户端 JavaScript 端突出显示如下内容:
return result.replace( input.replace(/[aeiou]/g,"."), "<b>$1</b>");
它有点更奇特,因为我正在转义特殊的正则表达式字符输入。 这对于点击开始时的简单单字匹配来说很好,但如果你突然希望支持像“london Cafe”这样的多字匹配,它真的会崩溃:
input = input.strip().toLowerCase(); //fyi prototype's strip is like trim
re = new RegEx(input.replace(/[aeiou]/g,".").replace(/\s+/g,"|"),"gi");
return result.replace(re, "<b>$1</b>");
这不适用于说“london ca”(正在输入伦敦咖啡馆),因为它将“Jack London Cabin, Dawson City, Canada”标记为: "Jack London ca ;bin, Dawson City, Canada"
[特别注意“ck”和“Ci”]
因此我有点看对于一些不那么疯狂的事情:
input = input.strip().toLowerCase();
input = input.replace(/a/g,"[ÀàÁáÂâÃãÄäÅåÆæĀāĂ㥹]");
input = input.replace(/e/g,"[ÈèÉéÊêËëĒēĔĕĖėĘęĚě]");
// ditto for i, o, u, y, c, n, maybe also d, g, h, j, k, l, r, s, t, w, z
re = new RegEx(input.replace(/\s+/g,"|"),"gi");
return result.replace(re, "<b>$1</b>");
是否有一个编译表,我可以参考将一系列字符(这些字符是其他字符的重音版本)映射到该字符,我并不是指简单的 unicode 图表。 如果是这样,我可以避免使用奇怪的、可能很慢的 RegEx 语句吗?
关于赏金:
在我开始赏金之前,有两个答案,一个告诉我用 Ruby 来做,另一个是 MizzardX 编写的一个,这是一个完成我在问题中提出的基本形式。 现在请不要误会我的意思,我真的很感激像他一样彻底地解决这个问题,但我只是希望可能有另一种方法。 到目前为止,似乎每个过来查看问题和答案的人都认为 MizzardX 涵盖得很好,或者他们没有不同的方法。 我会对不同的方法感兴趣,如果在赏金结束之前它根本不可用,MizzardX 将赢得赏金(尽管在一个残酷的扭曲中,他的编辑使它成为社区维基的答案,所以我不确定是否他会得到赏金!)
I need to compare 2 strings as equal such as these:
Lubeck == Lübeck
In JavaScript.
Why? Well, I have an auto-completion field that's going out to a Java service using Lucene, where place names are stored naturally (as Lübeck), but also indexed as normalized text,
import sun.text.Normalizer;
oDoc.setNameLC = Normalizer.normalize(oLocName, Normalizer.DECOMP, 0)
.toLowerCase().replaceAll("[^\\p{ASCII}]","");
This way some-one who doesn't know to type "Mèxico" can type "mexico" and get a match which returns "Mèxico" (among a lot of other possible hits, like "Café Mèxico, Dubai, UAE").
Now the thing is I don't have the ability to change the service to do any highlighting on the server side, therefore I am highlighting on the client JavaScript side with something like:
return result.replace( input.replace(/[aeiou]/g,"."), "<b>$1</b>");
It's a little more fancy because I am escaping special regex characters in the input. This is fine for simple one word matches at the beginning of a hit, but it really breaks down if you suddenly wish to support multi-word matches like "london cafe":
input = input.strip().toLowerCase(); //fyi prototype's strip is like trim
re = new RegEx(input.replace(/[aeiou]/g,".").replace(/\s+/g,"|"),"gi");
return result.replace(re, "<b>$1</b>");
This doesn't work for say "london ca" (was typing london cafe), because it would mark "Jack London Cabin, Dawson City, Canada" as: "Ja<b>ck</b> <b>London</b> <b>ca</b>bin, Dawson <b>Ci</b>ty, <b>Ca<b/>nada"
[note the "ck" and "Ci" particularly]
Therefore I'm sort of looking for something that's not as crazy as:
input = input.strip().toLowerCase();
input = input.replace(/a/g,"[ÀàÁáÂâÃãÄäÅåÆæĀāĂ㥹]");
input = input.replace(/e/g,"[ÈèÉéÊêËëĒēĔĕĖėĘęĚě]");
// ditto for i, o, u, y, c, n, maybe also d, g, h, j, k, l, r, s, t, w, z
re = new RegEx(input.replace(/\s+/g,"|"),"gi");
return result.replace(re, "<b>$1</b>");
Is there a compiled table I can refer to mapping a range of characters which are accented versions of an other character to that character, by which I don't mean the plain unicode chart. And if so, could I avoid using weird, possibly slow, RegEx statements?
About the bounty:
Before I started a bounty there were two answers, the one pointing me to doing it in Ruby, and the one that MizzardX wrote which was a completion of the basic form I'd put in my question. Now don't get me wrong, I really appreciate working it out as completely as he did, but I just wished that there might be another way. It seems so far that everyone who's dropped by to look at the question and answer has decided that MizzardX covers it just fine, or that they have no different approach. I would be interested in a different approach, and if it simply isn't available before the bounty closes, MizzardX will win the bounty (though in a cruel twist, his edits mad it a community wiki answer, so I'm not sure if he'll get the bounty!)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
更完整的版本,支持区分大小写、连字等。
原始来源:http://lehelk.com/2011/05/06/script-to-remove-diacritics/
A more complete version with case sensitive support, ligatures and whatnot.
Original source at: http://lehelk.com/2011/05/06/script-to-remove-diacritics/