Java中解析固定格式的文本文件
假设我知道一种文本文件格式,
比如说,每行包含 4 个字段,如下所示:
firstword secondword thirdword fourthword firstword2 secondword2 thirdword2 fourthword2 ...
我需要将其完全读入内存,
我可以使用这种方法:
open a text file while not EOF read line by line split each line by a space create a new object with four fields extracted from each line add this object to a Set
好的,但是有没有更好的东西,一个特殊的第 3 方 Java 库?
这样我们就可以预先定义每个文本行的结构并使用某些函数解析文件
thirdpartylib.setInputTextFileFormat("format.xml"); thirdpartylib.parse(Set, "pathToFile")
?
Suppose I know a text file format,
say, each line contains 4 fields like this:
firstword secondword thirdword fourthword firstword2 secondword2 thirdword2 fourthword2 ...
and I need to read it fully into memory
I can use this approach:
open a text file while not EOF read line by line split each line by a space create a new object with four fields extracted from each line add this object to a Set
Ok, but is there anything better, a special 3-rd party Java library?
So that we could define the structure of each text line beforehand and parse the file with some function
thirdpartylib.setInputTextFileFormat("format.xml"); thirdpartylib.parse(Set, "pathToFile")
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您明确知道分隔符是什么,那么您建议的方法将快速可靠,并且代码开销非常小。第三方库(谷歌“java文本文件库”以获得一长串列表)的好处是它可能有一堆代码来处理作者关心的奇怪情况。缺点是,如果您正在处理简单且可靠的文本文件格式,那么它可能会比您需要的代码更多。
自己这样做的好处是,您可以根据您的要求调整代码,包括可扩展性问题,如果您有大量数据,这很可能是一个考虑因素。通常,第三方库会完整读取文件,如果您有数百万行,这可能不切实际。
我的建议是花一个小时左右编写自己的内容,看看会得到什么结果。您可以轻松破解它。如果事实证明您有一个复杂的问题需要解决有关数据格式的不同特殊问题,那么就开始寻找一个库。
If you know definitively what the separator will be then your suggested aproach will be fast and reliable and have very little code overhead. The upside with a 3rd party library (google "java text file library" for a long list) is that it is likely to have a bunch of code to handle odd cases that the authors care about. The downside is that it will probably be more code than you need if you have a simple and reliable text file format you are handling.
The upside of doing this yourself is that you can tune the code to exactly your requirements, including scalability issues which may well be a consideration if you have a lot of data. Quite often 3rd party libraries will make a full read of the file which may not be practical if you have, say, several million rows.
My recommendation would be to spend an hour or so writing your own and see where you get. You may crack it with very little effort. If it turns out you have a complex problem to solve with different special issues around data format, then start looking for a library.
你可以这样做:
但是你确实需要更好地定义“更好”的含义。上述方法对于“坏”输入不会有很好的表现,但它会非常快(这实际上取决于 Set 的实现。如果您不断调整它的大小,您可能会导致性能损失)。
使用 XML 并定义模式将允许您在解析之前验证输入,并且可能会简化对象创建,但您不能在每行上只有四个字符串(您将需要 XML 标记等)。有关示例第三方库,请参阅 XMLBeans。
You can do it like this:
But you really need to better define what you mean by 'better'. The above approach will not behave nicely with 'bad' input but it will be pretty fast (it really depends on the implementation of the Set. If you're constantly resizing it you may incur a performance penalty).
Using XML and defining a schema will allow you to validate the input before parsing and will probably streamline object creation but you won't be able to just have four strings on each line (you'll need XML tags, etc.). See XMLBeans for an example third party library.