关联大型 XML 文档中的数据

发布于 2024-12-13 09:00:12 字数 1156 浏览 0 评论 0原文

我有一个如下所示的 XML 结构:

<root>
    <index>
        <item>item 1</item>
        <item>item 2</item>
        <!-- many more items -->
    <index>
    <data>
        <row>
            <!-- relates to item 1 -->
            <cell>1</cell>
            <cell>2</cell>
            <!-- many more cells -->
        </row>
        <row>
            <!-- relates to item 2 -->
            <cell>3</cell>
            <cell>4</cell>
            <!-- many more cells -->
        </row>
        <!-- as many rows as there are items in the index -->    
    </data>
</root>

我正在尝试创建一个解析器,它输出(到数据库)如下所示的结构:

item 1 : [1, 2, ...]
item 2 : [3, 4, ...]
...

通常,我会使用 sax 解析器,构造一个 HashMap,在解析器时填充键传递索引元素,然后添加单元格数据。

但是,该文档可能包含大量数据,因此我担心会遇到内存问题。

我的问题是:如何以尽可能少的内存使用来解析文件?

我想到的一件事是构造两个 SAX 解析器,一个运行索引,另一个解析数据。问题是我不知道如何暂停一个解析器,启动另一个解析器,暂停另一个解析器,重新启动第一个解析器等等。

这是可能的还是有更好的方法来处理这个问题?

顺便说一句:遗憾的是,我完全无法控制 XML 的格式。

I have an XML structure that looks like this:

<root>
    <index>
        <item>item 1</item>
        <item>item 2</item>
        <!-- many more items -->
    <index>
    <data>
        <row>
            <!-- relates to item 1 -->
            <cell>1</cell>
            <cell>2</cell>
            <!-- many more cells -->
        </row>
        <row>
            <!-- relates to item 2 -->
            <cell>3</cell>
            <cell>4</cell>
            <!-- many more cells -->
        </row>
        <!-- as many rows as there are items in the index -->    
    </data>
</root>

I'm trying to create a parser that outputs (to a database) a structure like this:

item 1 : [1, 2, ...]
item 2 : [3, 4, ...]
...

Normally, I'd use a sax parser, construct a HashMap, fill the keys when the parser passes the index element and afterwards add the cell data.

However, the document may contain a lot of data so I'm afraid I will run into memory issues.

My question is: how do I parse the file with as little memory usage as possible?

One thing I thought about was to construct two SAX parsers, one that runs over the index and another that parses the data. The problem is I have no idea how I can suspend one parser, start the other, suspend the other, restart the first one and so on.

Is this possible or are there better ways to deal with this?

BTW: sadly, I have absolutely no control over the format of the XML.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

葬シ愛 2024-12-20 09:00:12

除了哈希映射之外,SAX 解析器不需要在内存中保存太多内容。我将 SAX 解析索引元素以生成 List ,然后对于每个项目元素,我可以从列表中删除该项目(断言它在那里,将其删除),然后添加到 <代码>地图<项目,列表<单元格>。

您将需要的内存是项目总数和每个单元格的条目。我认为您不需要维护比使用 SAX 解析时更多的上下文。

The SAX parser isn't going to need to keep a lot in memory other than the hash map. I would SAX parse the index element to generate List<Item> and then for each item element I can remove the item from the list (assert that it is in there, remove it) and then add to Map<Item,List<Cell>>.

The memory that you are going to be needing is the total number of items and an entry for each cell. I don't think you need to maintain much more context than that when parsing using SAX.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文