xml工具设计问题

发布于 2024-12-05 16:11:35 字数 241 浏览 1 评论 0原文

我在一次采访中被问到这个问题。当然,解决方案有很多方法,但只是想知道是否有一些真正脱颖而出的最佳方法。有一个 2GB 的巨大 xml 文件存储在具有 512 MB RAM 的低端 PC 的硬盘中。 xml 文件存储时间戳和相应的字符串值。我必须设计一个工具来解析 xml 文件以获取特定信息,例如特定时间戳中的字符串。面试官并不关心工具中的搜索技术。他希望采用高水平的方法来设计该工具,仅考虑该工具的 5.12 亿 RAM 和 2 GB 大小。有什么有趣的设计方法吗?

I have been asked this question at an interview. Ofcourse there are many approaches to the solution but just wanted to know if there is some really best approach that stands out. There is a huge xml file of 2gb that is stored in the hard disk of a low end PC having a 512 mb RAM.
The xml file stores timestamps and corresponding string values. I have to design a tool that parses the xml file to get specific information, such as a string in a particular timestamp. The interviewer is not concerned about the searching technique in the tool. He wants to get a high level approach as to the design of the tool, considering only 512mn RAM and only 2GB size of the tool. Are there any interesting design appraches to this ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

献世佛 2024-12-12 16:11:36

XML 解析有两种方法 1) 使用 dom 解析器 2) 使用 sax 解析器。尝试使用 dom 解析器解析具有 512B RAM 的 2GB 文件肯定会导致内存不足异常,因此,使用 sax 解析器,它也会更快,因为你已经知道你在寻找什么。

There are two approaches to XML parsing 1) using dom parser 2) using sax parser. Trying to parse a 2GB file with 512B RAM using dom parser is guaranteed to result in Out of Memory exception, therefore, go with sax parser which will also be faster as you already know what you are looking for.

℡寂寞咖啡 2024-12-12 16:11:35

对于此用例,我将使用 Java SE 6 中的 StAX API,而不是 SAX。下面的代码来自我对类似问题的回答。 StAX 用于将大型 XML 文件拆分为多个较小的文件:

import java.io.*;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            File file = new File("out/" + xsr.getAttributeValue(null, "account") + ".xml");
            t.transform(new StAXSource(xsr), new StreamResult(file));
        }
    }

}

下面是 skaffman 的类似答案,其中描述了 StAX 如何能够用于按块处理 XML 文档。在他的回答中,JAXB用于处理块:

Instead of SAX, I would use the StAX APIs in Java SE 6 for this use case. The code below is from an answer of mine to a similar question. StAX is used to split a large XML file into several smaller files:

import java.io.*;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            File file = new File("out/" + xsr.getAttributeValue(null, "account") + ".xml");
            t.transform(new StAXSource(xsr), new StreamResult(file));
        }
    }

}

Below is similar answer by skaffman where here describes how StAX can be used to process an XML document in chunks. In his answer JAXB is used to process the chunks:

遇到 2024-12-12 16:11:35

也许解析应该使用 SAX 而不是 DOM 来完成。与 DOM 解析器一样,在访问数据之前,您在内存中拥有完整的文档。如果我理解正确,那么您从一开始就已经知道您感兴趣的时间戳,因此您可以使用 SAX 解析器来获取相应的字符串值,这应该更快并且不应该消耗那么多内存。

Maybe the parsing should be done with SAX instead of DOM. As with a DOM parser you have the complete document in memory before you access the data. If I understand you correct, then you already know the timestamps you are interested in from the beginning, so you could use a SAX Parser to get the corresponding string values, which should be faster and should not consume that much memory.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文