当前位置：文江博客话题详情

如何检测非法的UTF-8字节序列以在java输入流中替换它们？

发布于 2024-09-25 04:56:58 字数 785 浏览 13 评论 0原文

有问题的文件不在我的控制之下。大多数字节序列都是有效的 UTF-8，而不是 ISO-8859-1（或其他编码）。我想尽我所能提取尽可能多的信息。

该文件包含一些非法字节序列，应将其替换为替换字符。

这不是一件容易的事，它认为它需要一些关于 UTF-8 状态机的知识。

Oracle 有一个包装器可以满足我的需要：
UTF8ValidationFilter javadoc

是有类似的东西可用（商业或作为免费软件）？

谢谢
-斯蒂芬

解决方案：

final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);

原文

The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding).
I want to do my best do extract as much information as possible.

The file contains a few illegal byte sequences, those should be replaces with the replacement character.

It's not an easy task, it think it requires some knowledge about the UTF-8 state machine.

Oracle has a wrapper which does what I need:
UTF8ValidationFilter javadoc

Is there something like that available (commercially or as free software)?

Thanks
-stephan

Solution:

final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萤火眠眠 2024-10-02 04:56:58

java.nio.charset.CharsetDecoder做你需要的。此类提供字符集解码，并针对不同类型的错误提供用户可定义的操作（请参阅 onMalformedInput() 和 onUnmappableCharacter())。

CharsetDecoder 写入 OutputStream，您可以使用 java.io.PipedOutputStream，有效创建过滤的 InputStream。

回复收藏 0 原文

唔猫 2024-10-02 04:56:58

一种方法是读取前几个字节以检查字节顺序标记（如果存在）。有关 BOM 的更多信息：http://en.wikipedia.org/wiki/Byte_order_mark url，您将找到 BOM 字节表。然而，一个问题是，UTF-8 不需要在其标头中使用 BOM。还有另一种解决问题的方法是通过模式识别（每次读取几个字节 - 8 位）。无论如何，这是一个复杂的解决方案..

回复收藏 0 原文

回眸一遍 2024-10-02 04:56:58

您想要的行为已经是 InputStreamReader 的默认行为。所以不需要自己指定。这就足够了：

final BufferedInputStream in = new BufferedInputStream(istream);
final Reader inputReader = new InputStreamReader(in, StandardCharsets.UTF_8);

The behaviour you want is already the default for InputStreamReader. So there is no need to specify it yourself. This suffices:

final BufferedInputStream in = new BufferedInputStream(istream);
final Reader inputReader = new InputStreamReader(in, StandardCharsets.UTF_8);

回复收藏 0 原文

~没有更多了~

关于作者

我纯我任性

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何检测非法的UTF-8字节序列以在java输入流中替换它们？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

佚名

羁客

天天爱笑的徐老师

星

夏日落

隐诗

友情链接

如何检测非法的UTF-8字节序列以在java输入流中替换它们？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

佚名

羁客

天天爱笑的徐老师

星

夏日落

隐诗

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。