如何高效的编写xml数据库文件？

发布于 2024-10-14 07:13:06 字数 770 浏览 2 评论 0原文

我想构建一个 XML 文件作为数据存储。它应该看起来像这样：

<datastore>
    <item>
        <subitem></subitem>
        ...
        <subitem></subitem>
    </item>
    ....
    <item>
        <subitem></subitem>
        ...
        <subitem></subitem>
    </item>
</datastore>

在运行时我可能需要将项目添加到数据存储区。项目数量可能很高，因此我不想将整个文档保存在内存中并且无法使用 DOM。我只想写发生变化的部分。 或者 DOM 支持这个吗？

我第一次看到了 StAX，但我不确定它是否符合我的要求。

在关闭根元素之前记住文件末尾的光标位置不是最好吗？这始终是添加新项目的位置。因此，如果我记住该位置并在更改期间保持最新状态，我可以在末尾添加一个新项目，而无需迭代整个文件。

也许第二个光标可以独立于第一个光标使用，以仅出于阅读目的迭代文档。

我看不出 StAX 支持这些，是吗？

是否没有一种基于块的文件 API 而不是基于流的 API？文件和文件系统不是块“设备”的典型示例吗？如果有这样的 API，它能帮助我解决我的问题吗？

提前致谢。

原文

I want to build an XML file as a datastore. It should look something like this:

<datastore>
    <item>
        <subitem></subitem>
        ...
        <subitem></subitem>
    </item>
    ....
    <item>
        <subitem></subitem>
        ...
        <subitem></subitem>
    </item>
</datastore>

At runtime I may need to add items to the datastore. The number of items may be high, so that I don't want to hold the whole document in memory and can't use DOM. I just want to write the part where a change occures. Or does DOM supports this?

I had a first look at StAX, but I am not sure if it does what I want.

Wouldn't it be the best to remember a cursor position at the end of the file just right before the root element is beeing closed? That is always the position where new items will be added. So if I remember that position and keep it up to date during changes, I could add an new item at the end, without iterating through the whole file .

Maybe a second cursor, could be used independendly from the first one, to iterate over the document just for reading purposes.

I can't see that StAX supports any of this, does it?

Isn't there a block based API for files instead of a stream bases one? Aren't files and filesystems typical examples for block "devices"? And if there is such an API, does it help me with my problem?

Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雅心素梦 2024-10-21 07:13:06

更新 XML 基本上是不可能的，因为没有“廉价”的方法来插入数据。

附加 XML 并不是那么糟糕。您需要做的就是查找文件末尾，然后返回“结束标记”（在本例中为），然后开始写入。总而言之，这是一个廉价的操作，但没有一个框架真正支持这一点，因为它们大多都设计用于处理格式良好的完整 XML 文档，作为一个整体，而不是碎片。

您可以使用类似 StAX 的东西，但在这种情况下，StAX 不知道。标签，而是它只是知道。标签及其元素。然后，您创建项目并开始一遍又一遍地写入您已设置的同一个 OutputStream。

这是最好的方法。

但是，如果您需要删除或更改数据，那么您就需要重写内容，或者进行一些修改，例如将元素标记为“非活动”，在 XML 文件中查找它们，寻找“active=”Y”属性，然后将 Y 就地更改为 N。这是可以完成的，而且效率很高，但它远远超出了普通 XML 处理框架允许您执行的操作。如果我要这样做，我会阅读整个文件并跟踪这些条目并记下它们在其中的位置，以便以后我可以轻松有效地查找和更改它们。

然后，当您更新某些内容时，您会“停用”旧的内容，并“附加”新的内容。最终通过重写所有文件并丢弃旧的“不活动”条目来对文件进行 GC。

回复收藏 0 原文

红衣飘飘貌似仙 2024-10-21 07:13:06

根据经验，XML 文件作为数据存储的效率不是很高，对于您似乎想要使用它们的基于记录的数据来说则不然。

但是，如果您已经获得了该文件并且完全无法对其执行任何操作，则可以使用 StAX XMLEventReader 和 XMLEventWriter 快速读取文件并插入/修改其中的元素。

但是当我说“快速”时，我的意思是比 DOM 更快，但远不及任何关系数据库那么有效。

更新：您可以考虑的另一个选项是vtd-xml，虽然我没有在实际项目中尝试过，但实际上看起来还不错。

回复收藏 0 原文

七颜 2024-10-21 07:13:06

如果您总是想在末尾追加项目，那么处理此问题的最佳方法是使用两个 XML 文件。外部的 datstore.xml 只是一个包装器，如下所示：

<!DOCTYPE datastore [
  <!ENTITY e SYSTEM "items.xml">
]>
<datastore>&e;</datastore>

文件 items.xml 如下所示：

<item>....</item>
<item>....</item>
<item>....</item>

没有包装器元素。

当您想要追加数据时，可以打开 items.xml 并写入到其末尾。当您想要读取数据时，请使用 XML 解析器打开 datastore.xml。

当然，一旦数据增长超过 20Mb 左右，使用 XML 数据库可能会更好。但我多年来一直使用这种方法来记录撒克逊订单，目前文件大小约为 8Mb，而且效果很好。

If you always want to append items at the end, then the best way to handle this is to have two XML files. The outer one datstore.xml is simply a wrapper, and looks like this:

<!DOCTYPE datastore [
  <!ENTITY e SYSTEM "items.xml">
]>
<datastore>&e;</datastore>

The file items.xml looks like this:

<item>....</item>
<item>....</item>
<item>....</item>

with no wrapper element.

When you want to append data, you can open items.xml and write to the end of it. When you want to read data, open datastore.xml with an XML parser.

Of course, once your data grows beyond 20Mb or so, it may well be better to use an XML database. But I've been using this approach for years for records of Saxon orders, with files that are currently about 8Mb, and it works fine.

回复收藏 0 原文