如何高效的编写xml数据库文件?
我想构建一个 XML 文件作为数据存储。它应该看起来像这样:
<datastore>
<item>
<subitem></subitem>
...
<subitem></subitem>
</item>
....
<item>
<subitem></subitem>
...
<subitem></subitem>
</item>
</datastore>
在运行时我可能需要将项目添加到数据存储区。项目数量可能很高,因此我不想将整个文档保存在内存中并且无法使用 DOM。我只想写发生变化的部分。 或者 DOM 支持这个吗?
我第一次看到了 StAX,但我不确定它是否符合我的要求。
在关闭根元素之前记住文件末尾的光标位置不是最好吗?这始终是添加新项目的位置。因此,如果我记住该位置并在更改期间保持最新状态,我可以在末尾添加一个新项目,而无需迭代整个文件。
也许第二个光标可以独立于第一个光标使用,以仅出于阅读目的迭代文档。
我看不出 StAX 支持这些,是吗?
是否没有一种基于块的文件 API 而不是基于流的 API?文件和文件系统不是块“设备”的典型示例吗?如果有这样的 API,它能帮助我解决我的问题吗?
提前致谢。
I want to build an XML file as a datastore. It should look something like this:
<datastore>
<item>
<subitem></subitem>
...
<subitem></subitem>
</item>
....
<item>
<subitem></subitem>
...
<subitem></subitem>
</item>
</datastore>
At runtime I may need to add items to the datastore. The number of items may be high, so that I don't want to hold the whole document in memory and can't use DOM. I just want to write the part where a change occures. Or does DOM supports this?
I had a first look at StAX, but I am not sure if it does what I want.
Wouldn't it be the best to remember a cursor position at the end of the file just right before the root element is beeing closed? That is always the position where new items will be added. So if I remember that position and keep it up to date during changes, I could add an new item at the end, without iterating through the whole file .
Maybe a second cursor, could be used independendly from the first one, to iterate over the document just for reading purposes.
I can't see that StAX supports any of this, does it?
Isn't there a block based API for files instead of a stream bases one? Aren't files and filesystems typical examples for block "devices"? And if there is such an API, does it help me with my problem?
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
更新 XML 基本上是不可能的,因为没有“廉价”的方法来插入数据。
附加 XML 并不是那么糟糕。您需要做的就是查找文件末尾,然后返回“结束标记”(在本例中为),然后开始写入。总而言之,这是一个廉价的操作,但没有一个框架真正支持这一点,因为它们大多都设计用于处理格式良好的完整 XML 文档,作为一个整体,而不是碎片。
您可以使用类似 StAX 的东西,但在这种情况下,StAX 不知道。标签,而是它只是知道- 。标签及其元素。然后,您创建项目并开始一遍又一遍地写入您已设置的同一个 OutputStream。
这是最好的方法。
但是,如果您需要删除或更改数据,那么您就需要重写内容,或者进行一些修改,例如将元素标记为“非活动”,在 XML 文件中查找它们,寻找“active=”Y”属性,然后将 Y 就地更改为 N。这是可以完成的,而且效率很高,但它远远超出了普通 XML 处理框架允许您执行的操作。如果我要这样做,我会阅读整个文件并跟踪这些条目并记下它们在其中的位置,以便以后我可以轻松有效地查找和更改它们。
然后,当您更新某些内容时,您会“停用”旧的内容,并“附加”新的内容。最终通过重写所有文件并丢弃旧的“不活动”条目来对文件进行 GC。
Updating XML is basically impossible because there's no "cheap" way to insert data.
Appending XML is not so bad. All you need to do there is seek to the end of the file, then GO BACK over the "end tag" (</datastore> in this case), and then just start writing. This is a cheap operation all told, but none of the frameworks really support this as they're all mostly designed to work with well formed, full boat XML documents, as a whole, not in pieces.
You could use a StAX like thing, but in this case, StAX isn't aware of the <datastore> tag, rather it's just aware of the <item> tags and its elements. Then you create Items and start writing, over and over and over, to the same OutputStream that you have set up.
That's the best way to do this.
But if you need to delete or change data, then you get to rewrite stuff, or do hacks, such as marking elements as "inactive", hunting them down in the XML file, seeking to the 'active="Y"' attribute, and then inplace changing the Y to N. It can be done, it will be mostly efficient, but its far and away outside what the normal XML processing frameworks let you do. If I were to do that, I'd read the entire file and keep track of those entries and note their locations within it so later I could easily seek and change them efficiently.
Then when you update something, you "inactivate" the old one, and "append" the new one. Eventually get to GC the file by rewriting it all and throwing out the old, "inactive" entries.
根据经验,XML 文件作为数据存储的效率不是很高,对于您似乎想要使用它们的基于记录的数据来说则不然。
但是,如果您已经获得了该文件并且完全无法对其执行任何操作,则可以使用 StAX
XMLEventReader
和XMLEventWriter
快速读取文件并插入/修改其中的元素。但是当我说“快速”时,我的意思是比 DOM 更快,但远不及任何关系数据库那么有效。
更新:您可以考虑的另一个选项是vtd-xml,虽然我没有在实际项目中尝试过,但实际上看起来还不错。
As a rule of thumb, XML files aren't very efficient as datastores, not for the record-based data you seem to want to use them for.
But if you've already got the file and absolutely can't do anything about it, you can use StAX
XMLEventReader
s andXMLEventWriter
s to read through a file quickly and insert/modify elements in it.But when I say quickly, what I mean is more quickly than DOM would be, but nowhere near as effective as any relational DB.
Update: Another option you can consider is vtd-xml, although I haven't tried it in real projects, it actually looks pretty decent.
如果您总是想在末尾追加项目,那么处理此问题的最佳方法是使用两个 XML 文件。外部的 datstore.xml 只是一个包装器,如下所示:
文件 items.xml 如下所示:
没有包装器元素。
当您想要追加数据时,可以打开 items.xml 并写入到其末尾。当您想要读取数据时,请使用 XML 解析器打开 datastore.xml。
当然,一旦数据增长超过 20Mb 左右,使用 XML 数据库可能会更好。但我多年来一直使用这种方法来记录撒克逊订单,目前文件大小约为 8Mb,而且效果很好。
If you always want to append items at the end, then the best way to handle this is to have two XML files. The outer one datstore.xml is simply a wrapper, and looks like this:
The file items.xml looks like this:
with no wrapper element.
When you want to append data, you can open items.xml and write to the end of it. When you want to read data, open datastore.xml with an XML parser.
Of course, once your data grows beyond 20Mb or so, it may well be better to use an XML database. But I've been using this approach for years for records of Saxon orders, with files that are currently about 8Mb, and it works fine.
部分更新 XML 文件并不是很容易或高效,因此您不会发现对它作为用例的太多支持。
确实,听起来您需要一个合适的数据库,也许还需要一个将数据导出为 XML 的工具。
如果您不想使用数据库并坚持将数据纯粹存储为 XML,您可以考虑将所有项目作为对象保留在内存中。每当添加新的内容时,您都可以将它们全部写入 XML。它可能看起来效率低下,但根据您的数据大小可能仍然足够好。
如果您选择此路径,您可能需要查看 Xstream 库以使这变得非常简单,请参阅流教程< /a> 一个简单的例子。
It's not very easy or efficient to partially update an XML file so you won't find much support for it as a use case.
Really it sound like you need a proper database, perhaps with a tool to export the data as XML.
If you don't want to use a DB and insist on storing the data purely as XML you might consider keeping all your items in memory as objects. Whenever a new one is added you can write all of them out to XML. It might seem inefficient, but depending on your data size might still be good enough.
If you choose this path, you might want to check out the Xstream library to make this quite easy, see stream tutorial for a quick example.