我的 XML 文件(以 UTF-8 编码)有两个问题:
这两个问题都阻止我使用 SAX 解析器解析 XML。我当前的方法是将文件读入字符串并使用正则表达式来提取这些字符并将字符串写回文件,效果很好。
然而,我的文件非常大(数百兆字节),并且将文件读入字符串,每次调用replaceAll()时都会创建相同大小的结果字符串,很快会导致java堆空间错误。
增加堆大小绝对不是一个长期的解决方案。我需要流式传输文件并即时提取所有这些字符。
关于有效的解决方案应该是什么样子有什么建议吗?
I have XML files (encoded in UTF-8) that have two issues:
-
Some of them (not all) contain a Byte order mark EF BB BF
-
Some of them (not all) contain Null characters 00, distributed over the whole file.
Both issues prevent me from parsing the XML with a SAX Parser. My current approach was to read the file into a String and use regex in order to extract these characters and write the string back to a file, which worked fine.
However my files are quite large (hundreds of Megabytes) and reading the file into a String an creating a result String of the same size every time I call a replaceAll(), quickly leads to a java heap space error.
Increasing the heap size is definitely not a long term solution. I will need to stream the file and extract all these character on the fly.
Any suggestions on how an efficient solution should look like?
发布评论
评论(3)
我只关注 BOM,发现空字节问题为时已晚。我仍然将其作为补充发布,以防有人仅对 BOM 有问题。请善待反对票。 :)
您可以使用支持
mark()
和reset()
的InputStream
读取前三个字节,读取第一个字节三个字节,如果不是 BOM,则重置:我使用 BufferedInputStream ,因为 FileInputStream 不支持 mark() 。
I only concentrated on the BOM, seeing the issue with the null bytes too late. I still post it as an addition in case someone has a problem with BOMs only. Please be kind with respect to downvotes. :)
You could read the first three bytes with an
InputStream
that supportsmark()
andreset()
, read the first three bytes and reset if they were not a BOM:I use
BufferedInputStream
becauseFileInputStream
does not supportmark()
.我将继承 FilterInputStream 来在运行时过滤掉不需要的字节。
该任务应该相当简单,因为字节顺序标记可能只位于文件的开头(因此您只需要检查那里),并且可以通过简单的
==
比较轻松过滤空字节(不需要类似正则表达式的功能)。这很可能还会提高性能,因为您不需要在重新读取之前将完整更正的文件写入磁盘。
I would subclass
FilterInputStream
to filter out the undesired bytes at runtime.The task should be rather easy as byte order marks are probably only at the start of the file (so you only need to check there) and nul-bytes can easily be flter with a simple
==
comparison (no need for regex-like features).This will most likely also increase performance as you don't need to write out the full corrected file to disk before re-reading it.
为什么不在将数据读入 SAX 解析器时过滤数据呢?这样您就不需要重写该文件。您可以重写 FilterInputStream 的 read() 方法来删除不需要的字节。
我认为这就是@Joachim 所建议的。 ;)
Why don't you filter the data as you read it into the SAX parser. This way you won't need to re-write the file. You can override the read() methods of FilterInputStream to drop the bytes you don't want.
I think that is what @Joachim is suggesting. ;)