Java 中的规范化/取消重音文本

发布于 2024-12-13 22:57:59 字数 874 浏览 4 评论 0原文

如何在 Java 中标准化/非重音文本？我目前正在使用 java.text.Normalizer：

Normalizer.normalize(str, Normalizer.Form.NFD)
    .replaceAll("\\p{InCombiningDiacriticalMarks}+", "")

但它远非完美。例如，它保留挪威语字符 æ 和 ø 不变。有谁知道有替代方案吗？我正在寻找能够将各种语言的字符转换为 az 范围的东西。我意识到有不同的方法可以做到这一点（例如，应该将 æ 编码为“a”、“e”甚至“ae”？）并且我愿意接受任何解决方案。我宁愿不自己写东西，因为我认为我不太可能为所有语言做好这件事。性能并不重要。

用例：我想将用户输入的名称转换为普通的 az 范围名称。转换后的名称将显示给用户，因此我希望它尽可能接近用户用原始语言编写的内容。

编辑：

好吧，谢谢你们否定这篇文章并且没有解决我的问题，耶！ :) 也许我应该忽略用例。但请允许我澄清一下。我需要转换名称才能将其存储在内部。 我无法控制此处允许的字母选择。用户可以在 URL 中看到该名称。与此论坛上您的用户名相同，如果您单击您的姓名，就会在 URL 中显示您的用户名。该论坛将“Băşan”等名称转换为“baan”，将“Øyvind”等名称转换为“yvind”。我相信可以做得更好。我正在寻找想法，最好是一个库函数来为我做到这一点。我知道我无法正确理解，我知道“o”和“ø”是不同的，等等，但如果我的名字是“Øyvind”并且我在在线论坛上注册，我可能更喜欢我的用户名是“奥伊温德”而不是“伊温德”。希望这有意义！谢谢！

（不，我们不会允许用户选择自己的用户名。我真的只是在寻找 java.text.Normalizer 的替代品。谢谢！）

原文

How can I normalize/unaccent text in Java? I am currently using java.text.Normalizer:

Normalizer.normalize(str, Normalizer.Form.NFD)
    .replaceAll("\\p{InCombiningDiacriticalMarks}+", "")

But it is far from perfect. For example, it leaves Norwegian characters æ and ø untouched. Does anyone know of an alternative? I am looking for something that would convert characters in all sorts of languages to just the a-z range. I realize there are different ways to do this (e.g. should æ be encoded as 'a', 'e' or even 'ae'?) and I'm open for any solution. I prefer to not write something myself since I think it's unlikely that I will be able to do this well for all languages. Performance is NOT critical.

The use case: I want to convert a user entered name to a plain a-z ranged name. The converted name will be displayed to the user, so I want it to match as close as possible what the user wrote in his original language.

EDIT:

Alright people, thanks for negging the post and not addressing my question, yay! :) Maybe I should have left out the use case. But please allow me to clarify. I need to convert the name in order to store it internally. I have no control over the choice of letters allowed here. The name will be visible to the user in, for example, the URL. The same way that your user name on this forum is normalized and shown to you in the URL if you click on your name. This forum converts a name like "Bășan" to "baan" and a name like "Øyvind" to "yvind". I believe it can be done better. I am looking for ideas and preferably a library function to do this for me. I know I can not get it right, I know that "o" and "ø" are different, etc, but if my name is "Øyvind" and I register on an online forum, I would likely prefer that my user name is "oyvind" and not "yvind". Hope that this makes any sense! Thanks!

(And NO, we will not allow the user to pick his own user name. I am really just looking for an alternative to java.text.Normalizer. Thanks!)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

逆蝶 2024-12-20 22:57:59

假设你已经考虑了你正在做的事情的所有含义，所有可能出错的方式，当你得到中文象形图和其他拉丁字母中没有对应内容的东西时你会做什么

......据我所知，图书馆可以满足您的需求。如果您有一个等价列表（如您所说，“æ”到“ae”或其他），您可以将它们存储在一个文件中（或者，如果您经常这样做，则存储在内存中的排序数组中，出于性能原因），然后进行查找并按字符替换。如果内存中有空间将（unicode 字符数）存储为 char 数组，则能够运行每个字符的 unicode 值并进行直接查找将是最有效的。

即 /u1234 =>查找数组[1234] => 'q'

或其他什么。

所以你会有一个看起来像这样的循环：

StringBuffer buf = new StringBuffer();
for (int i = 0; i < string.length(); i++) {
  buf.append(lookupArray[Character.unicodeValue(string.charAt(i))]);
}

我从头开始编写的，所以可能有一些错误的方法调用或其他东西。

您必须采取一些措施来处理分解的字符，可能需要使用前瞻缓冲区。

祝你好运 - 我确信这充满了陷阱。

Assuming you have considering ALL of the implications of what you're doing, ALL the ways it can go wrong, what you'll do when you get Chinese pictograms and other things that have no equivalent in the Latin Alphabet...

There's not a library that I know of that does what you want. If you have a list of equivalencies (as you say, the 'æ' to 'ae' or whatever), you could store them in a file (or, if you're doing this a lot, in a sorted array in memory, for performance reason) and then do a lookup and replace by character. If you have the space in memory to store the (# of unicode characters) as a char array, being able to run through the unicode values of each character and do a straight lookup would be the most efficient.

i.e., /u1234 => lookupArray[1234] => 'q'

or whatever.

so you'll have a loop that looks like:

StringBuffer buf = new StringBuffer();
for (int i = 0; i < string.length(); i++) {
  buf.append(lookupArray[Character.unicodeValue(string.charAt(i))]);
}

I wrote that from scratch, so there are probably some bad method calls or something.

You'll have to do something to handle decomposed characters, probably with a lookahead buffer.

Good luck - I'm sure this is fraught with pitfalls.

回复收藏 0 原文

~没有更多了~