如何检测非法的UTF-8字节序列以在java输入流中替换它们?

发布于 2024-09-25 04:56:58 字数 785 浏览 13 评论 0原文

有问题的文件不在我的控制之下。大多数字节序列都是有效的 UTF-8,而不是 ISO-8859-1(或其他编码)。 我想尽我所能提取尽可能多的信息。

该文件包含一些非法字节序列,应将其替换为替换字符。

这不是一件容易的事,它认为它需要一些关于 UTF-8 状态机的知识。

Oracle 有一个包装器可以满足我的需要:
UTF8ValidationFilter javadoc

是有类似的东西可用(商业或作为免费软件)?

谢谢
-斯蒂芬

解决方案:

final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);

The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding).
I want to do my best do extract as much information as possible.

The file contains a few illegal byte sequences, those should be replaces with the replacement character.

It's not an easy task, it think it requires some knowledge about the UTF-8 state machine.

Oracle has a wrapper which does what I need:
UTF8ValidationFilter javadoc

Is there something like that available (commercially or as free software)?

Thanks
-stephan

Solution:

final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

萤火眠眠 2024-10-02 04:56:58

java.nio.charset.CharsetDecoder做你需要的。此类提供字符集解码,并针对不同类型的错误提供用户可定义的操作(请参阅 onMalformedInput()onUnmappableCharacter())。

CharsetDecoder 写入 OutputStream,您可以使用 java.io.PipedOutputStream,有效创建过滤的 InputStream

java.nio.charset.CharsetDecoder does what you need. This class provides charset decoding with user-definable actions on different kinds of errors (see onMalformedInput() and onUnmappableCharacter()).

CharsetDecoder writes to an OutputStream, which you can pipe into an InputStream using java.io.PipedOutputStream, effectively creating a filtered InputStream.

唔猫 2024-10-02 04:56:58

一种方法是读取前几个字节以检查字节顺序标记(如果存在)。有关 BOM 的更多信息:http://en.wikipedia.org/wiki/Byte_order_mark url,您将找到 BOM 字节表。然而,一个问题是,UTF-8 不需要在其标头中使用 BOM。还有另一种解决问题的方法是通过模式识别(每次读取几个字节 - 8 位)。无论如何,这是一个复杂的解决方案..

One way would be to read the first few bytes to check Byte Order Mark (if exists). More information on BOM: http://en.wikipedia.org/wiki/Byte_order_mark In the given url, you will find a table of the BOM bytes. However, one problem is, UTF-8 does not require to use BOM in its' header. There is another way to solve the problem is by pattern recognition (read few bytes-8 bits each time). Anyway, this is the complicated solution..

回眸一遍 2024-10-02 04:56:58

您想要的行为已经是 InputStreamReader 的默认行为。所以不需要自己指定。这就足够了:

final BufferedInputStream in = new BufferedInputStream(istream);
final Reader inputReader = new InputStreamReader(in, StandardCharsets.UTF_8);

The behaviour you want is already the default for InputStreamReader. So there is no need to specify it yourself. This suffices:

final BufferedInputStream in = new BufferedInputStream(istream);
final Reader inputReader = new InputStreamReader(in, StandardCharsets.UTF_8);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文