关联大型 XML 文档中的数据

发布于 2024-12-13 09:00:12 字数 1156 浏览 4 评论 0原文

我有一个如下所示的 XML 结构：

<root>
    <index>
        <item>item 1</item>
        <item>item 2</item>
        <!-- many more items -->
    <index>
    <data>
        <row>
            <!-- relates to item 1 -->
            <cell>1</cell>
            <cell>2</cell>
            <!-- many more cells -->
        </row>
        <row>
            <!-- relates to item 2 -->
            <cell>3</cell>
            <cell>4</cell>
            <!-- many more cells -->
        </row>
        <!-- as many rows as there are items in the index -->    
    </data>
</root>

我正在尝试创建一个解析器，它输出（到数据库）如下所示的结构：

item 1 : [1, 2, ...]
item 2 : [3, 4, ...]
...

通常，我会使用 sax 解析器，构造一个 HashMap，在解析器时填充键传递索引元素，然后添加单元格数据。

但是，该文档可能包含大量数据，因此我担心会遇到内存问题。

我的问题是：如何以尽可能少的内存使用来解析文件？

我想到的一件事是构造两个 SAX 解析器，一个运行索引，另一个解析数据。问题是我不知道如何暂停一个解析器，启动另一个解析器，暂停另一个解析器，重新启动第一个解析器等等。

这是可能的还是有更好的方法来处理这个问题？

顺便说一句：遗憾的是，我完全无法控制 XML 的格式。

原文

I have an XML structure that looks like this:

<root>
    <index>
        <item>item 1</item>
        <item>item 2</item>
        <!-- many more items -->
    <index>
    <data>
        <row>
            <!-- relates to item 1 -->
            <cell>1</cell>
            <cell>2</cell>
            <!-- many more cells -->
        </row>
        <row>
            <!-- relates to item 2 -->
            <cell>3</cell>
            <cell>4</cell>
            <!-- many more cells -->
        </row>
        <!-- as many rows as there are items in the index -->    
    </data>
</root>

I'm trying to create a parser that outputs (to a database) a structure like this:

item 1 : [1, 2, ...]
item 2 : [3, 4, ...]
...

Normally, I'd use a sax parser, construct a HashMap, fill the keys when the parser passes the index element and afterwards add the cell data.

However, the document may contain a lot of data so I'm afraid I will run into memory issues.

My question is: how do I parse the file with as little memory usage as possible?

One thing I thought about was to construct two SAX parsers, one that runs over the index and another that parses the data. The problem is I have no idea how I can suspend one parser, start the other, suspend the other, restart the first one and so on.

Is this possible or are there better ways to deal with this?

BTW: sadly, I have absolutely no control over the format of the XML.

分享到QQ

分享到微博