java中是否有与 //TRANSLIT 等效的 iconv ?
有没有办法实现java中字符集之间的字符音译?类似于 unix 命令(或类似的 php 函数):
iconv -f UTF-8 -t ASCII//TRANSLIT < some_doc.txt > new_doc.txt
最好对字符串进行操作,与文件无关
我知道您可以使用 String
构造函数更改编码,但这不能处理音译不在结果字符集中的字符。
Is there a way to achieve transliteration of characters between charsets in java? something similar to the unix command (or similar php function):
iconv -f UTF-8 -t ASCII//TRANSLIT < some_doc.txt > new_doc.txt
preferably operating on strings, not having anything to do with files
I know you can can change encodings with the String
constructor, but that doesn't handle transliteration of characters that aren't in the resulting charset.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不知道有哪个库能够完全执行 iconv 声称要做的事情(这似乎没有很好定义)。但是,您可以在 Java 中使用 "标准化" 来执行以下操作:删除字符中的重音符号。 Unicode 标准明确定义了此过程。
我认为 NFKD(兼容性分解),然后过滤非 ASCII 字符可能会让您接近您想要的。显然,这是一个有损过程;您永远无法恢复原始字符串中的所有信息,所以要小心。
通过此处使用的过滤,您可能会使某些字符串变得不可读。例如,一串中文字符将被完全过滤掉,因为它们都没有 ASCII 表示(这更像是 iconv 的
//IGNORE
)。总的来说,构建自己的有效字符替换查找表,或者至少组合可以安全删除的字符(重音符号和其他内容)的查找表会更安全。最佳解决方案取决于您期望处理的输入字符的范围。
I'm not aware of any libraries that do exactly what
iconv
purports to do (which doesn't seem very well defined). However, you can use "normalization" in Java to do things like remove accents from characters. This process is well defined by Unicode standards.I think NFKD (compatibility decomposition) followed by a filtering of non-ASCII characters might get you close to what you want. Obviously, this is a lossy process; you can never recover all of the information that was in the original string, so be careful.
With the filtering used here, you might render some strings unreadable. For example, a string of Chinese characters would be filtered away completely because none of them have an ASCII representation (this is more like iconv's
//IGNORE
).Overall, it would be safer to build your own lookup table of valid character substitutions, or at least of combining characters (accents and things) that are safe to strip. The best solution depends on the range of input characters you expect to handle.
一种解决方案是将execute iconv 作为外部进程来执行。这肯定会冒犯纯粹主义者。这取决于系统上是否存在 iconv,但它可以正常工作并且完全按照您的要求进行操作:
One solution is to execute execute iconv as an external process. It will certainly offend purists. It depends on presence of iconv on the system but it works and does exactly what you want:
让我们从 Ericson 的答案的轻微变化开始,并在其上构建更多
//TRANSLIT
功能:分解字符以获得 ASCII-
String
虽然这对于 US-ASCII 应该表现相同该解决方案更容易适用于不同的目标编码。 (由于首先分解字符,因此对于其他编码不一定会产生更好的结果)
该函数对于补充代码点是安全的(这对于 ASCII 作为目标来说有点过大,但如果选择其他目标编码,可能会减少头痛) 。
另请注意,返回的是常规 Java 字符串;如果您需要 ASCII-
byte[]
,您仍然需要对其进行转换(但我们确保没有违规字符...)。这就是您可以将其扩展到更多字符集的方法:
替换或分解字符以获得可在提供的
Charset
中编码的String
我强烈建议构建一个广泛的替换表因为这个简单的示例已经展示了您可能会如何丢失所需的信息,例如
€
。对于 ASCII,这种实现当然会慢一些,因为分解仅根据需要进行,并且 StringBuilder 现在可能需要增长以容纳替换。GNU 的 iconv 使用 translit.def 执行
//TRANSLIT
转换,如果您想将其用作替换映射,可以使用这样的方法:导入原始 文件
//TRANSLIT
-替换Let's start with a slight variation of Ericson's answer and build more
//TRANSLIT
features on it:Decompose chars to gain ASCII-
String
While this should behave the same for US-ASCII this solution is easier to adopt for different target encodings. (As characters are decomposed first this does not necessarily yield better results for other encodings though)
The function is safe for supplementary code points (which is a bit overkill for ASCII as target, but may reduce head-aches if another target encoding is chosen).
Also note, that a regular Java-String is returned; if you need an ASCII-
byte[]
you still need to convert it (but as we ensured there are no offending characters...).And this is how you could extend it to more character-sets:
Replace or decompose characters to gain a
String
encodeable in suppliedCharset
I would strongly recommend building an extensive replacement-table as the simple example already shows how you otherwise might lose desired information like
€
. For ASCII this implementation is of course a bit slower as decomposition is only done on demand and theStringBuilder
now may need to grow to hold the replacements.GNU's iconv uses the replacements listed in translit.def to perform a
//TRANSLIT
-conversion and you can use a method like this if you want to use it as replacement-map:Import original
//TRANSLIT
-replacements