Java:检测给定字符编码的不可显示字符
我目前正在开发一个应用程序来验证和解析 CSV 文件。 CSV 文件必须以 UTF-8 进行编码,尽管有时我们会得到错误编码的文件。 CSV 文件很可能包含德语字母表的特殊字符(ä、Ö、Ü、ß),因为 CSV 文件中的大多数文本都是德语。
对于验证器部分,我需要确保该文件是 UTF-8 编码的。只要不存在特殊字符,解析就很可能没有问题。
到目前为止,我尝试的是将文件作为字节读取,并使用一些库来检测(或猜测)编码。我尝试了这篇博文的大部分可能性:http://fredeaker。 blogspot.com/2007/01/character-encoding-detection.html
但是我尝试的所有库都没有返回正确的编码,因此我无法解析特殊字符。
现在回答我的问题: 有没有办法确定给定的字符编码(如 UTF-8)来检测未正确编码的字符?所以基本上在(Eclipse)控制台中显示的字符是问号。
或者有没有其他方法可以正确判断字符编码? 我只需要知道它是否是UTF-8。
预先感谢大家的帮助! :)
此致, 罗伯特
I'm currently working on an application to validate and parse CSV-files.
The CSV files have to be encoded in UTF-8, although sometimes we get files in a false encoding.
The CSV-files most likely contain special characters of the German alphabet (Ä, Ö, Ü, ß) as most of the texts within the CSV file are in German language.
For the part of the validator, i need to make sure, the file is UTF-8 encoded. As long as there are no special characters present, there is most likely no problem with parsing.
What i have tried so far is to read the file as bytes and use some libraries to detect (or guess) the encoding. I tried most of possibilities of this blog post: http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html
But all libraries I tried didn't return the correct encoding and therefore I couldn't parse the special characters.
Now to my question:
Is there a way to determine for a given Character Encoding like UTF-8 to detect characters that are not encoded correctly? So basically the characters that are displayed in the (Eclipse) console as quesion marks.
Or is there any other way to correctly determine the character encoding?
I just need to know if it's UTF-8 or not.
Thank you all in advance for your help! :)
Best Regards,
Robert
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
无法正确解码的字节序列将被替换为“替换字符”
\uFFFD
,显示如下:�。但是,如果输出设备不支持该字符,则可能会使用问号 (?) 代替。因此,将 UTF-8 数据解码为
String
对象后,搜索\uFFFD
的出现。或者,如果您使用
CharsetDecoder
您自己创建的,您可以获得更多控制权。例如,您可以指定如果有任何无法解码的字节序列,则应引发Exception
。或者您可以忽略它们。或者您可以指定不同的字符作为替换字符。Byte sequences that cannot be decoded correctly will be replaced with the "replacement character",
\uFFFD
, which is displayed like this: �. However, if the output device doesn't support that character, it is likely to use a question mark (?) instead.So, after decoding the UTF-8 data into
String
objects, search for occurrences of\uFFFD
.Alternatively, if you set up an
InputStreamReader
with an instance ofCharsetDecoder
that you create yourself, you can get a lot more control. For example, you can specify that if any byte sequences that cannot be decoded, anException
should be raised. Or you can ignore them. Or you can specify a different character as the replacement character.如果文本是德语并且编码不是 UTF-8,则可能是 windows-1252。或者与 windows-1252 兼容的东西,例如 ISO-8859-15。既然如此,Laforge 的 GuessEncoding 应该就是您所需要的。我已经用过很多次了,从来没有遇到过问题,而且几乎只适用于英文文本;德语应该更容易被发现。
我看到他仍然没有在他的博客或源文件中指定许可证,但我知道这些类在 中使用Groovy,所以这应该不是问题。
If the text is German and the encoding isn't UTF-8, it's probably windows-1252. Or something compatible with windows-1252, like ISO-8859-15. That being the case, Laforge's GuessEncoding should be all you need. I've used it quite a bit and never had a problem, and that's working almost exclusively with English text; German should be even easier to detect.
I see he still hasn't specified a license on his blog or in the source files, but I know those classes are used in Groovy, so that shouldn't be a problem.