有效地替换字符串中的所有重音字符?
对于一个穷人在客户端实现近排序规则正确的排序,我需要一个 JavaScript 函数来高效字符串中的单个字符替换。
这就是我的意思(请注意,这适用于德语文本,其他语言的排序不同):
native sorting gets it wrong: a b c o u z ä ö ü collation-correct would be: a ä b c o ö u ü z
基本上,我需要将给定字符串中所有出现的“ä”替换为“a”(依此类推)。这样,本机排序的结果将非常接近用户期望的结果(或数据库返回的结果)。
其他语言也有能力做到这一点: Python 提供 str.translate ()
,在 Perl 中有 tr/…/… /
, XPath 有一个函数 translate()
,ColdFusion 有 ReplaceList()
。但是 JavaScript 呢?
这是我现在所拥有的。
// s would be a rather short string (something like
// 200 characters at max, most of the time much less)
function makeSortString(s) {
var translate = {
"ä": "a", "ö": "o", "ü": "u",
"Ä": "A", "Ö": "O", "Ü": "U" // probably more to come
};
var translate_re = /[öäüÖÄÜ]/g;
return ( s.replace(translate_re, function(match) {
return translate[match];
}) );
}
对于初学者来说,我不喜欢每次调用函数时都会重建正则表达式。我想闭包在这方面可以有所帮助,但由于某种原因我似乎没有掌握它的窍门。
有人能想出更有效的方法吗?
下面的答案分为两类:
- 不同程度的完整性和效率的字符串替换函数(我最初问的是什么)
- 迟到的提及
String#localeCompare代码>
,现在是 在 JS 引擎中得到广泛支持(在提出问题时还没有那么多),并且可以更优雅地解决此类问题。
For a poor man's implementation of near-collation-correct sorting on the client side I need a JavaScript function that does efficient single character replacement in a string.
Here is what I mean (note that this applies to German text, other languages sort differently):
native sorting gets it wrong: a b c o u z ä ö ü collation-correct would be: a ä b c o ö u ü z
Basically, I need all occurrences of "ä" of a given string replaced with "a" (and so on). This way the result of native sorting would be very close to what a user would expect (or what a database would return).
Other languages have facilities to do just that: Python supplies str.translate()
, in Perl there is tr/…/…/
, XPath has a function translate()
, ColdFusion has ReplaceList()
. But what about JavaScript?
Here is what I have right now.
// s would be a rather short string (something like
// 200 characters at max, most of the time much less)
function makeSortString(s) {
var translate = {
"ä": "a", "ö": "o", "ü": "u",
"Ä": "A", "Ö": "O", "Ü": "U" // probably more to come
};
var translate_re = /[öäüÖÄÜ]/g;
return ( s.replace(translate_re, function(match) {
return translate[match];
}) );
}
For starters, I don't like the fact that the regex is rebuilt every time I call the function. I guess a closure can help in this regard, but I don't seem to get the hang of it for some reason.
Can someone think of something more efficient?
Answers below fall in two categories:
- String replacement functions of varying degrees of completeness and efficiency (what I was originally asking about)
- A late mention of
String#localeCompare
, which is now widely supported among JS engines (not so much at the time of the question) and could solve this category of problem much more elegantly.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(23)
这是基于 Unicode 标准的更完整版本。
一些例子:
Here is a more complete version based on the Unicode standard.
Some examples:
https://stackoverflow.com/a/37511463
https://stackoverflow.com/a/37511463
我无法具体说明您想要对函数本身做什么,但如果您不喜欢每次都构建正则表达式,这里有两个解决方案以及关于每个解决方案的一些注意事项。
这是执行此操作的一种方法:
这显然会使正则表达式成为函数本身的属性。您可能不喜欢的唯一一件事(或者您可能不喜欢,我想这取决于)是正则表达式现在可以在函数体之外进行修改。因此,有人可以这样做来修改内部使用的正则表达式:
所以,有这个选项。
获得闭包并防止某人修改正则表达式的一种方法是将其定义为匿名函数赋值,如下所示:
希望这对您有用。
更新:现在还很早,我不知道为什么我之前没有看到明显的情况,但将
translate
对象放入闭包中也可能很有用:I can't speak to what you are trying to do specifically with the function itself, but if you don't like the regex being built every time, here are two solutions and some caveats about each.
Here is one way to do this:
This will obviously make the regex a property of the function itself. The only thing you may not like about this (or you may, I guess it depends) is that the regex can now be modified outside of the function's body. So, someone could do this to modify the interally-used regex:
So, there is that option.
One way to get a closure, and thus prevent someone from modifying the regex, would be to define this as an anonymous function assignment like this:
Hopefully this is useful to you.
UPDATE: It's early and I don't know why I didn't see the obvious before, but it might also be useful to put you
translate
object in a closure as well:此类口音的正确术语是变音符号。谷歌搜索这个术语后,我发现了这个函数它是backbone.paginator的一部分。它有一个非常完整的变音符号集合,并用最直观的 ascii 字符替换它们。我发现这是当今最完整的 Javascript 解决方案。
完整功能供以后参考:
The correct terminology for such accents is Diacritics. After Googling this term, I found this function which is part of
backbone.paginator
. It has a very complete collection of Diacritics and replaces them with their most intuitive ascii character. I found this to be the most complete Javascript solution available today.The full function for future reference:
只需标准化链并运行替换代码:
请参阅标准化
然后就可以使用这个功能了:
Simply should be normalized chain and run a replacement codes:
See normalize
Then you can use this function:
我想不出比使用这个令人惊奇的解决方案<更简单的方法来有效地从字符串中删除所有变音符号/a>.
看看它的实际效果:
I can't think about an easier way to efficiently remove all diacritics from a string than using this amazing solution.
See it in action:
我认为这可能会更干净/更好(尽管我还没有测试它的性能):
或者如果你仍然太担心性能,让我们两全其美:
编辑(by @ Tomalak)
我很欣赏这个主意。然而,如下面的评论所述,实施过程中存在一些问题。
这是我将如何实现它。
I think this might work a little cleaner/better (though I haven't test it's performance):
Or if you are still too worried about performance, let's get the best of both worlds:
EDIT (by @Tomalak)
I appreciate the idea. However, there are several things wrong with the implementation, as outlined in the comment below.
Here is how I would implement it.
基于 Jason Bunting 的解决方案,这是我现在使用的。
整个事情是为了 jQuery tablesorter 插件:用于(几乎正确)对非英语表进行排序tablesorter 插件需要使用自定义
textExtraction
函数。这一个:
'dd.mm.yyyy'
)更改为可识别的格式('yyyy-mm-dd'
)请小心以 UTF-8 编码保存 JavaScript 文件,否则它将无法工作。
你可以这样使用它:
Based on the solution by Jason Bunting, here is what I use now.
The whole thing is for the jQuery tablesorter plug-in: For (nearly correct) sorting of non-English tables with tablesorter plugin it is necessary to make use of a custom
textExtraction
function.This one:
'dd.mm.yyyy'
) to a recognized format ('yyyy-mm-dd'
)Be careful to save the JavaScript file in UTF-8 encoding or it won't work.
You can use it like this:
针对您的要求的完整解决方案是:
The complete solution to your request is:
如果您正在专门寻找一种将重音字符转换为非重音字符的方法,而不是对重音字符进行排序的方法,那么可以使用 String.localeCompare 函数来查找与扩展的。例如,您可能希望从页面标题生成一个人类友好的 url slug。如果是这样,你可以这样做:
这应该表现得很好,但如果需要进一步优化,可以使用二分搜索和 localeCompare 作为比较器来定位基本字符。请注意,大小写被保留,并且选项允许保留、替换或删除非字母字符,或者没有可以替换的匹配拉丁字符。此实现更快、更灵活,并且应该可以在添加新角色时使用。缺点是,如果需要支持像“ꝡ”这样的复合字符,则必须对其进行专门处理。
If you're looking specifically for a way to convert accented characters to non-accented characters, rather than a way to sort accented characters, with a little finagling, the String.localeCompare function can be manipulated to find the basic latin characters that match the extended ones. For example, you might want to produce a human friendly url slug from a page title. If so, you can do something like this:
This should perform quite well, but if further optimization were needed, a binary search could be used with
localeCompare
as the comparator to locate the base character. Note that case is preserved, and options allow for either preserving, replacing, or removing characters that aren't alphabetical, or do not have matching latin characters they can be replaced with. This implementation is faster and more flexible, and should work with new characters as they are added. The disadvantage is that compound characters like 'ꝡ' have to be handled specifically, if they need to be supported.我做了一个原型版本:
使用如下:
这会将字符串更改为a_o_u_A_O_U_ss
I made a Prototype Version of this:
Use like:
This will will change the String to a_o_u_A_O_U_ss
根据现有的答案和一些建议,我创建了这个:
它使用真正的字符而不是 unicode 列表,并且效果很好。
您可以使用它,就像
您可以轻松地将此函数转换为不是字符串原型一样。然而,由于我喜欢在这种情况下使用字符串原型,所以你必须自己做。
Basing on existing answers and some suggestions, I've created this one:
It uses real chars instead of unicode list and works well.
You can use it like
You can easily convert this function to not be string prototype. However, as I'm fan of using string prototype in such cases, you'll have to do it yourself.
Kierons 解决方案的 javascript 直接移植: https:// github.com/rwarasaurus/nano/blob/master/system/helpers.php#L61-73:
稍加修改的版本,使用字符映射而不是两个数组:
比较这两个我做了一个简单的基准测试: http://jsperf.com/replace-foreign-characters
A direct port to javascript of Kierons solution: https://github.com/rwarasaurus/nano/blob/master/system/helpers.php#L61-73:
And a slightly modified version, using a char-map instead of two arrays:
To compare these two methods I made a simple benchmark: http://jsperf.com/replace-foreign-characters
没有一个答案提到
String.localeCompare< /code>
,它恰好完全符合您最初的要求,但不是您所要求的。
不过,旧版浏览器不支持第二个和第三个参数。尽管如此,这是一个值得考虑的选择。
Not a single answer mentions
String.localeCompare
, which happens to do exactly what you originally wanted, but not what you're asking for.The second and third parameter are not supported by older browsers though. It's an option worth considering nonetheless.
我只是想使用 String# 发布我的解决方案区域设置比较
I just wanted to post my solution using String#localeCompare
https://lodash.com/docs/4.17.15#deburr
我花了花了一段时间才找到这个,希望它对其他人也有帮助。
https://lodash.com/docs/4.17.15#deburr
It took me a while to find this, hope it'll help somebody else too.
很久以前,我在 Java 中做到了这一点,并找到了其他人的解决方案,该解决方案基于单个字符串,该字符串捕获对转换很重要的 Unicode 表的一部分 - 其余部分被转换为 ?或任何其他替换字符。所以我尝试将其转换为 JavaScript。请注意,我不是 JS 专家。 :-)
这会转换大部分 latin1+2 Unicode 字符。它无法将单个字符转换为多个字符。我不知道它在 JS 上的性能,在 Java 中,这是迄今为止最快的常见解决方案(6-50x),没有地图,没有正则表达式,什么都没有。它产生严格的 ASCII 输出,可能会丢失信息,但输出的大小与输入匹配。
我用 http://www.webtoolkitonline.com/javascript-tester.html 它产生了
Supa, co? lstczyaoa??
正如预期的那样。Long time ago I did this in Java and found someone else's solution based on a single string that captures part of the Unicode table that was important for the conversion - the rest was converted to ? or any other replacement character. So I tried to convert it to JavaScript. Mind that I'm no JS expert. :-)
This converts most of latin1+2 Unicode characters. It is not able to translate single char to multiple. I don't know its performance on JS, in Java this is by far the fastest of common solutions (6-50x), there is no map, there is no regex, nothing. It produces strict ASCII output, potentially with a loss of information, but the size of the output matches the input.
I tested the snippet with http://www.webtoolkitonline.com/javascript-tester.html and it produced
Supa, co? lstczyaoa??
as expected.如果你想实现“ä”在“a”之后并且不被视为相同的排序,那么你可以使用像我这样的函数。
您始终可以更改字母表以获得不同甚至奇怪的排序。但是,如果您希望某些字母相等,那么您必须像
a = a.replace(/ä/, 'a')
或类似的操作字符串,正如许多人上面已经回复的那样。如果有人想在所有小写单词之前使用所有大写单词,我已经包含了大写字母(那么您必须省略.toLowerCase()
)。If you want to achieve sorting where "ä" comes after "a" and is not treated as the same, then you can use a function like mine.
You can always change the alphabet to get different or even weird sortings. However, if you want some letters to be equivalent, then you have to manipulate the strings like
a = a.replace(/ä/, 'a')
or similar, as many have already replied above. I've included the uppercase letters if someone wants to have all uppercase words before all lowercase words (then you have to ommit.toLowerCase()
).一个简单易行的方法:
所以这样做:
输出:
A simple and easy way:
So do this:
Output:
Crisalin 的回答几乎是完美的。只是提高了性能以避免每次运行时创建新的正则表达式。
用法:
Answer os Crisalin is almost perfect. Just improved the performance to avoid create new RegExp on each run.
Usage:
如果您愿意,我已经用另一种方式解决了这个问题。
这里我使用了两个数组,其中searchChars包含将被替换的内容,< em>replaceChars 包含所需的字符。
I've solved it another way, if you like.
Here I used two arrays where searchChars containing which will be replaced and replaceChars containing desired characters.
对于使用 TypeScript 的小伙子和那些不想处理字符串原型的人,这里是 Ed. 的答案的打字稿版本:
For the lads using TypeScript and those who don't want to deal with string prototypes, here is a typescript version of Ed.'s answer:
如果你想以最快的方式解决这个问题,你可以使用 npm 包
latinize
https ://www.npmjs.com/package/latinize
If you want fastest way of solving this problem, you can use npm package
latinize
https://www.npmjs.com/package/latinize