流上字节的 Java 正则表达式替代方案

发布于 2024-11-05 03:59:18 字数 450 浏览 1 评论 0 原文

我的 XML 文件(以 UTF-8 编码)有两个问题:

  • 其中一些(不是全部)包含 字节顺序标记 EF BB BF

  • 其中一些(不是全部)包含空字符00,分布在整个文件中。

这两个问题都阻止我使用 SAX 解析器解析 XML。我当前的方法是将文件读入字符串并使用正则表达式来提取这些字符并将字符串写回文件,效果很好。 然而,我的文件非常大(数百兆字节),并且将文件读入字符串,每次调用replaceAll()时都会创建相同大小的结果字符串,很快会导致java堆空间错误。

增加堆大小绝对不是一个长期的解决方案。我需要流式传输文件并即时提取所有这些字符。

关于有效的解决方案应该是什么样子有什么建议吗?

I have XML files (encoded in UTF-8) that have two issues:

  • Some of them (not all) contain a Byte order mark EF BB BF

  • Some of them (not all) contain Null characters 00, distributed over the whole file.

Both issues prevent me from parsing the XML with a SAX Parser. My current approach was to read the file into a String and use regex in order to extract these characters and write the string back to a file, which worked fine.
However my files are quite large (hundreds of Megabytes) and reading the file into a String an creating a result String of the same size every time I call a replaceAll(), quickly leads to a java heap space error.

Increasing the heap size is definitely not a long term solution. I will need to stream the file and extract all these character on the fly.

Any suggestions on how an efficient solution should look like?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

迷雾森÷林ヴ 2024-11-12 03:59:21

我只关注 BOM,发现空字节问题为时已晚。我仍然将其作为补充发布,以防有​​人仅对 BOM 有问题。请善待反对票。 :)


您可以使用支持 mark()reset()InputStream 读取前三个字节,读取第一个字节三个字节,如果不是 BOM,则重置:

InputStream in = new BufferedInputStream(
        new FileInputStream(new File("xmlfile.xml")));
in.mark(3);
byte[] maybeBom = new byte[] {
        (byte) in.read(), (byte) in.read(), (byte) in.read() };

if(!Arrays.equals(maybeBom, new byte[] { (byte) 0xEF, (byte) 0xBB, (byte) 0xBF })) {
    in.reset();
}

我使用 BufferedInputStream ,因为 FileInputStream 不支持 mark() 。

I only concentrated on the BOM, seeing the issue with the null bytes too late. I still post it as an addition in case someone has a problem with BOMs only. Please be kind with respect to downvotes. :)


You could read the first three bytes with an InputStream that supports mark() and reset(), read the first three bytes and reset if they were not a BOM:

InputStream in = new BufferedInputStream(
        new FileInputStream(new File("xmlfile.xml")));
in.mark(3);
byte[] maybeBom = new byte[] {
        (byte) in.read(), (byte) in.read(), (byte) in.read() };

if(!Arrays.equals(maybeBom, new byte[] { (byte) 0xEF, (byte) 0xBB, (byte) 0xBF })) {
    in.reset();
}

I use BufferedInputStream because FileInputStream does not support mark().

百合的盛世恋 2024-11-12 03:59:19

我将继承 FilterInputStream 来在运行时过滤掉不需要的字节。

该任务应该相当简单,因为字节顺序标记可能只位于文件的开头(因此您只需要检查那里),并且可以通过简单的 == 比较轻松过滤空字节(不需要类似正则表达式的功能)。

这很可能还会提高性能,因为您不需要在重新读取之前将完整更正的文件写入磁盘。

I would subclass FilterInputStream to filter out the undesired bytes at runtime.

The task should be rather easy as byte order marks are probably only at the start of the file (so you only need to check there) and nul-bytes can easily be flter with a simple == comparison (no need for regex-like features).

This will most likely also increase performance as you don't need to write out the full corrected file to disk before re-reading it.

无语# 2024-11-12 03:59:19

为什么不在将数据读入 SAX 解析器时过滤数据呢?这样您就不需要重写该文件。您可以重写 FilterInputStream 的 read() 方法来删​​除不需要的字节。

我认为这就是@Joachim 所建议的。 ;)

Why don't you filter the data as you read it into the SAX parser. This way you won't need to re-write the file. You can override the read() methods of FilterInputStream to drop the bytes you don't want.

I think that is what @Joachim is suggesting. ;)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文