关联大型 XML 文档中的数据
我有一个如下所示的 XML 结构:
<root>
<index>
<item>item 1</item>
<item>item 2</item>
<!-- many more items -->
<index>
<data>
<row>
<!-- relates to item 1 -->
<cell>1</cell>
<cell>2</cell>
<!-- many more cells -->
</row>
<row>
<!-- relates to item 2 -->
<cell>3</cell>
<cell>4</cell>
<!-- many more cells -->
</row>
<!-- as many rows as there are items in the index -->
</data>
</root>
我正在尝试创建一个解析器,它输出(到数据库)如下所示的结构:
item 1 : [1, 2, ...]
item 2 : [3, 4, ...]
...
通常,我会使用 sax 解析器,构造一个 HashMap,在解析器时填充键传递索引元素,然后添加单元格数据。
但是,该文档可能包含大量数据,因此我担心会遇到内存问题。
我的问题是:如何以尽可能少的内存使用来解析文件?
我想到的一件事是构造两个 SAX 解析器,一个运行索引,另一个解析数据。问题是我不知道如何暂停一个解析器,启动另一个解析器,暂停另一个解析器,重新启动第一个解析器等等。
这是可能的还是有更好的方法来处理这个问题?
顺便说一句:遗憾的是,我完全无法控制 XML 的格式。
I have an XML structure that looks like this:
<root>
<index>
<item>item 1</item>
<item>item 2</item>
<!-- many more items -->
<index>
<data>
<row>
<!-- relates to item 1 -->
<cell>1</cell>
<cell>2</cell>
<!-- many more cells -->
</row>
<row>
<!-- relates to item 2 -->
<cell>3</cell>
<cell>4</cell>
<!-- many more cells -->
</row>
<!-- as many rows as there are items in the index -->
</data>
</root>
I'm trying to create a parser that outputs (to a database) a structure like this:
item 1 : [1, 2, ...]
item 2 : [3, 4, ...]
...
Normally, I'd use a sax parser, construct a HashMap, fill the keys when the parser passes the index element and afterwards add the cell data.
However, the document may contain a lot of data so I'm afraid I will run into memory issues.
My question is: how do I parse the file with as little memory usage as possible?
One thing I thought about was to construct two SAX parsers, one that runs over the index and another that parses the data. The problem is I have no idea how I can suspend one parser, start the other, suspend the other, restart the first one and so on.
Is this possible or are there better ways to deal with this?
BTW: sadly, I have absolutely no control over the format of the XML.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
除了哈希映射之外,SAX 解析器不需要在内存中保存太多内容。我将 SAX 解析索引元素以生成
List
,然后对于每个项目元素,我可以从列表中删除该项目(断言它在那里,将其删除),然后添加到 <代码>地图<项目,列表<单元格>。您将需要的内存是项目总数和每个单元格的条目。我认为您不需要维护比使用 SAX 解析时更多的上下文。
The SAX parser isn't going to need to keep a lot in memory other than the hash map. I would SAX parse the index element to generate
List<Item>
and then for each item element I can remove the item from the list (assert that it is in there, remove it) and then add toMap<Item,List<Cell>>
.The memory that you are going to be needing is the total number of items and an entry for each cell. I don't think you need to maintain much more context than that when parsing using SAX.