如何检测非法的UTF-8字节序列以在java输入流中替换它们?
有问题的文件不在我的控制之下。大多数字节序列都是有效的 UTF-8,而不是 ISO-8859-1(或其他编码)。 我想尽我所能提取尽可能多的信息。
该文件包含一些非法字节序列,应将其替换为替换字符。
这不是一件容易的事,它认为它需要一些关于 UTF-8 状态机的知识。
Oracle 有一个包装器可以满足我的需要:
UTF8ValidationFilter javadoc
是有类似的东西可用(商业或作为免费软件)?
谢谢
-斯蒂芬
解决方案:
final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);
The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding).
I want to do my best do extract as much information as possible.
The file contains a few illegal byte sequences, those should be replaces with the replacement character.
It's not an easy task, it think it requires some knowledge about the UTF-8 state machine.
Oracle has a wrapper which does what I need:
UTF8ValidationFilter javadoc
Is there something like that available (commercially or as free software)?
Thanks
-stephan
Solution:
final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
java.nio.charset.CharsetDecoder做你需要的。此类提供字符集解码,并针对不同类型的错误提供用户可定义的操作(请参阅
onMalformedInput()
和onUnmappableCharacter()
)。CharsetDecoder
写入OutputStream
,您可以使用java.io.PipedOutputStream
,有效创建过滤的InputStream
。java.nio.charset.CharsetDecoder does what you need. This class provides charset decoding with user-definable actions on different kinds of errors (see
onMalformedInput()
andonUnmappableCharacter()
).CharsetDecoder
writes to anOutputStream
, which you can pipe into anInputStream
usingjava.io.PipedOutputStream
, effectively creating a filteredInputStream
.一种方法是读取前几个字节以检查字节顺序标记(如果存在)。有关 BOM 的更多信息:http://en.wikipedia.org/wiki/Byte_order_mark url,您将找到 BOM 字节表。然而,一个问题是,UTF-8 不需要在其标头中使用 BOM。还有另一种解决问题的方法是通过模式识别(每次读取几个字节 - 8 位)。无论如何,这是一个复杂的解决方案..
One way would be to read the first few bytes to check Byte Order Mark (if exists). More information on BOM: http://en.wikipedia.org/wiki/Byte_order_mark In the given url, you will find a table of the BOM bytes. However, one problem is, UTF-8 does not require to use BOM in its' header. There is another way to solve the problem is by pattern recognition (read few bytes-8 bits each time). Anyway, this is the complicated solution..
您想要的行为已经是
InputStreamReader
的默认行为。所以不需要自己指定。这就足够了:The behaviour you want is already the default for
InputStreamReader
. So there is no need to specify it yourself. This suffices: