在 SAX 中解析大型 XML 文件时在 DOM 中加载本地块 (Java)

发布于 2024-12-13 10:05:46 字数 1235 浏览 0 评论 0原文

我有一个 xml 文件，我可以避免将其全部加载到内存中。众所周知，对于这样的文件，我最好使用 SAX 解析器（如果找到相关内容，它将沿着文件进行解析并调用事件。）

我当前的问题是我想“按块”处理文件这意味着：

解析文件并找到相关标签（节点）
将此标签完全加载到内存中（就像我们在 DOM 中所做的那样）
处理此实体（本地块）
当我完成块时，释放并继续 1.（直到“结束文件”）

在一个完美的世界中，我正在搜索这样的东西：

// 1. Create a parser and set the file to load
      IdealParser p = new IdealParser("BigFile.xml");
// 2. Set an XPath to define the interesting nodes
      p.setRelevantNodesPath("/path/to/relevant/nodes");
// 3. Add a handler to callback the right method once a node is found
      p.setHandler(new Handler(){
// 4. The method callback by the parser when a relevant node is found
      void aNodeIsFound(saxNode aNode)
   {
   // 5. Inflate the current node i.e. load it (and all its content) in memory
         DomNode d = aNode.expand();
   // 6. Do something with the inflated node (method to be defined somewhere)
         doThingWithNode(d);
    }
   });
// 7. Start the parser
      p.start();

我目前陷入如何展开一个“sax节点”（理解我……）有效地。

有没有与此类任务相关的Java框架或库？

原文

I've an xml file that I would avoid having to load all in memory.
As everyone know, for such a file I better have to use a SAX parser (which will go along the file and call for events if something relevant is found.)

My current problem is that I would like to process the file "by chunk" which means:

Parse the file and find a relevant tag (node)
Load this tag entirely in memory (like we would do it in DOM)
Do the process of this entity (that local chunk)
When I'm done with the chunk, release it and continue to 1. (until "end of file")

In a perfect world I'm searching some something like this:

// 1. Create a parser and set the file to load
      IdealParser p = new IdealParser("BigFile.xml");
// 2. Set an XPath to define the interesting nodes
      p.setRelevantNodesPath("/path/to/relevant/nodes");
// 3. Add a handler to callback the right method once a node is found
      p.setHandler(new Handler(){
// 4. The method callback by the parser when a relevant node is found
      void aNodeIsFound(saxNode aNode)
   {
   // 5. Inflate the current node i.e. load it (and all its content) in memory
         DomNode d = aNode.expand();
   // 6. Do something with the inflated node (method to be defined somewhere)
         doThingWithNode(d);
    }
   });
// 7. Start the parser
      p.start();

I'm currently stuck on how to expand a "sax node" (understand me…) efficiently.

Is there any Java framework or library relevant to this kind of task?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

万劫不复 2024-12-20 10:05:46

更新

您也可以只使用 javax.xml.xpath API：

package forum7998733;

import java.io.FileReader;
import javax.xml.xpath.*;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;

public class XPathDemo {

    public static void main(String[] args) throws Exception {
        XPathFactory xpf = XPathFactory.newInstance();
        XPath xpath = xpf.newXPath();
        InputSource xml = new InputSource(new FileReader("BigFile.xml"));
        Node result = (Node) xpath.evaluate("/path/to/relevant/nodes", xml, XPathConstants.NODE);
        System.out.println(result);
    }

}

下面是如何使用 StAX 完成此操作的示例。

input.xml

下面是一些示例 XML：

<statements>
   <statement account="123">
      ...stuff...
   </statement>
   <statement account="456">
      ...stuff...
   </statement>
</statements>

Demo

在此示例中，StAX XMLStreamReader 用于查找将转换为 DOM 的节点。在此示例中，我们将每个 statement 片段转换为 DOM，但您的导航算法可能更高级。

package forum7998733;

import java.io.FileReader;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.*;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("src/forum7998733/input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            DOMResult domResult = new DOMResult();
            t.transform(new StAXSource(xsr), domResult);

            DOMSource domSource = new DOMSource(domResult.getNode());
            StreamResult streamResult = new StreamResult(System.out);
            t.transform(domSource, streamResult);
        }
    }

}

输出

<?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="123">
      ...stuff...
   </statement><?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="456">
      ...stuff...
   </statement>

UPDATE

You could also just use the javax.xml.xpath APIs:

package forum7998733;

import java.io.FileReader;
import javax.xml.xpath.*;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;

public class XPathDemo {

    public static void main(String[] args) throws Exception {
        XPathFactory xpf = XPathFactory.newInstance();
        XPath xpath = xpf.newXPath();
        InputSource xml = new InputSource(new FileReader("BigFile.xml"));
        Node result = (Node) xpath.evaluate("/path/to/relevant/nodes", xml, XPathConstants.NODE);
        System.out.println(result);
    }

}

Below is a sample of how it could be done with StAX.

input.xml

Below is some sample XML:

<statements>
   <statement account="123">
      ...stuff...
   </statement>
   <statement account="456">
      ...stuff...
   </statement>
</statements>

Demo

In this example a StAX XMLStreamReader is used to find the node that will be converted to a DOM. In this example we convert each statement fragment to a DOM, but your navigation algorithm could be more advanced.

package forum7998733;

import java.io.FileReader;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.*;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("src/forum7998733/input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            DOMResult domResult = new DOMResult();
            t.transform(new StAXSource(xsr), domResult);

            DOMSource domSource = new DOMSource(domResult.getNode());
            StreamResult streamResult = new StreamResult(System.out);
            t.transform(domSource, streamResult);
        }
    }

}

Output

<?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="123">
      ...stuff...
   </statement><?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="456">
      ...stuff...
   </statement>

回复收藏 0 原文

翻了热茶 2024-12-20 10:05:46

可以使用 SAX 来完成...但我认为较新的 StAX（XML 流 API）将更好地满足您的目的。您可以创建一个 XMLEventReader 并使用它来解析您的文件，检测哪些节点符合您的标准之一。对于简单的基于路径的选择（不是真正的 XPath，而是一些简单的 / 分隔路径），您需要通过向新元素上的字符串添加条目或剪切条目来维护当前节点的路径在结束标签上。布尔标志足以维持您当前是否处于“相关模式”。

当您从以下位置获取 XMLEvents 时您的读者，您可以将相关内容复制到 XMLEventWriter 那您已经在一些合适的占位符上创建了，例如 StringWriter 或 ByteArrayOutputStream。一旦您完成了某些 XML 提取的复制，这些提取形成了您希望为其构建 DOM 的“子文档”，只需将占位符提供给 DocumentBuilder 以合适的形式。

这里的限制是您没有利用 XPath 语言的所有功能。如果您希望考虑节点位置等内容，则必须在自己的路径中预见到这一点。也许有人知道将真正的 XPath 实现集成到其中的好方法。

StAX 非常好，因为它让您可以控制解析，而不是通过像 SAX 这样的处理程序使用一些回调接口。

还有另一种选择：使用 XSLT。 XSLT 样式表是仅过滤掉相关内容的理想方法。您可以转换输入一次以获得所需的片段并处理它们。或者在同一输入上运行多个样式表，以便每次都获得所需的摘录。然而，更好（更高效）的解决方案是使用扩展函数和/或扩展元素。

扩展功能可以通过独立于所使用的 XSLT 处理器的方式来实现。它们在 Java 中使用起来相当简单，而且我知道您可以使用它们将完整的 XML 提取传递给方法，因为我已经这样做了。可能需要一些实验，但这是一个强大的机制。 DOM 提取（或节点）可能是此类方法可接受的参数类型之一。这将使文档构建到 XSLT 处理器上，这甚至更容易。

扩展元素也非常有用，但我认为它们需要以特定于实现的方式使用。如果您愿意将自己绑定到特定的 JAXP 设置（例如 Xerces + Xalan），那么它们可能就是答案。

当选择 XSLT 时，您将拥有完整 XPath 1.0 实现的所有优点，并且由于了解 XSLT 在 Java 中的良好状态而感到安心。它将输入树的构建限制为那些随时需要的节点，并且速度非常快，因为处理器倾向于将样式表编译成 Java 字节码而不是解释它们。不过，使用编译而不是解释可能会失去使用扩展元素的可能性。对此并不确定。扩展功能仍然是可能的。

无论您选择哪种方式，Java 中有很多用于 XML 处理的方法，如果您没有找到现成的解决方案，您会在实现此方面找到大量帮助。当然，这是最明显的事情......当有人做了艰苦的工作时，无需重新发明轮子。

祝你好运！

编辑：因为我实际上并没有感到沮丧，这里有一个使用我创建的 StAX 解决方案的演示。这当然不是最干净的代码，但它会给你基本的想法：

package staxdom;

import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;
import java.util.Stack;
import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class DOMExtractor {

    private final Set<String> paths;
    private final XMLInputFactory inputFactory;
    private final XMLOutputFactory outputFactory;
    private final DocumentBuilderFactory docBuilderFactory;
    private final Stack<QName> activeStack = new Stack<QName>();

    private boolean active = false;
    private String currentPath = "";

    public DOMExtractor(final Set<String> paths) {

        this.paths = Collections.unmodifiableSet(new HashSet<String>(paths));
        inputFactory = XMLInputFactory.newFactory();
        outputFactory = XMLOutputFactory.newFactory();
        docBuilderFactory = DocumentBuilderFactory.newInstance();

    }

    public void parse(final InputStream input) throws XMLStreamException, ParserConfigurationException, SAXException, IOException {

        final XMLEventReader reader = inputFactory.createXMLEventReader(input);
        XMLEventWriter writer = null;
        StringWriter buffer = null;
        final DocumentBuilder builder = docBuilderFactory.newDocumentBuilder();

        XMLEvent currentEvent = reader.nextEvent();

        do {

            if(active)
                writer.add(currentEvent);

            if(currentEvent.isEndElement()) {

                if(active) {

                    activeStack.pop();

                    if(activeStack.isEmpty()) {
                        writer.flush();
                        writer.close();
                        final Document doc;
                        final StringReader docReader = new StringReader(buffer.toString());
                        try {
                            doc = builder.parse(new InputSource(docReader));
                        } finally {
                            docReader.close();
                        }
                        //TODO: use doc
                        //Next bit is only for demo...
                        outputDoc(doc);
                        active = false;
                        writer = null;
                        buffer = null;
                    }

                }

                int index;
                if((index = currentPath.lastIndexOf('/')) >= 0)
                    currentPath = currentPath.substring(0, index);

            } else if(currentEvent.isStartElement()) {

                final StartElement start = (StartElement)currentEvent;
                final QName qName = start.getName();
                final String local = qName.getLocalPart();

                currentPath += "/" + local;

                if(!active && paths.contains(currentPath)) {

                    active = true;

                    buffer = new StringWriter();
                    writer = outputFactory.createXMLEventWriter(buffer);

                    writer.add(currentEvent);

                }

                if(active)
                    activeStack.push(qName);

            }

            currentEvent = reader.nextEvent();

        } while(!currentEvent.isEndDocument());

    }

    private void outputDoc(final Document doc) {


        try {
            final Transformer t = TransformerFactory.newInstance().newTransformer();
            t.transform(new DOMSource(doc), new StreamResult(System.out));
            System.out.println("");
            System.out.println("");
        } catch(TransformerException ex) {
            ex.printStackTrace();
        }

    }

    public static void main(String[] args) {

        final Set<String> paths = new HashSet<String>();
        paths.add("/root/one");
        paths.add("/root/three/embedded");

        final DOMExtractor me = new DOMExtractor(paths);

        InputStream stream = null;
        try {
            stream = DOMExtractor.class.getResourceAsStream("sample.xml");
            me.parse(stream);
        } catch(final Exception e) {
            e.printStackTrace();
        } finally {
            if(stream != null)
                try {
                    stream.close();
                } catch(IOException ex) {
                    ex.printStackTrace();
                }
        }

    }

}

和sample.xml文件（应该在同一个包中）：

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <one>
        <two>this is text</two>
        look, I can even handle mixed!
    </one>
    ... not sure what to do with this, though
    <two>
        <willbeignored/>
    </two>
    <three>
        <embedded>
            <and><here><we><go>
                Creative Commons Legal Code

                Attribution 3.0 Unported

                    CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
                    LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN
                    ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
                    INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
                    REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR
                    DAMAGES RESULTING FROM ITS USE.

                License

                THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE
                COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY
                COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS
                AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.

                BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE
                TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY
                BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS
                CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND
                CONDITIONS.
            </go></we></here></and>
        </embedded>
    </three>
</root>

编辑2：刚刚注意到Blaise Doughan的回答中有StAXSource。这样效率会更高。如果您要使用 StAX，请使用它。将消除保留一些缓冲区的需要。 StAX 允许您“查看”下一个事件，因此您可以检查它是否是具有正确路径的起始元素，而无需在将其传递到转换器之前消耗它。

It could be done with SAX... But I think the newer StAX (Streaming API for XML) will serve your purpose better. You could create an XMLEventReader and use that to parse your file, detecting which nodes adhere to one of your criteria. For simple path-based selection (not really XPath, but some simple / delimited path) you'd need to maintain a path to your current node by adding entries to a String on new elements or cutting of entries on an end tag. A boolean flag can suffice to maintain whether you're currently in "relevant mode" or not.

As you obtain XMLEvents from your reader, you could copy the relevant ones over to an XMLEventWriter that you've created on some suitable placeholder, like a StringWriter or ByteArrayOutputStream. Once you've completed the copying for some XML extract that forms a "subdocument" of what you wish to build a DOM for, simply supply your placeholder to a DocumentBuilder in a suitable form.

The limitation here is that you're not harnessing all the power of the XPath language. If you wish to take stuff like node position into account, you'd have to foresee that in your own path. Perhaps someone knows of a good way of integrating a true XPath implementation into this.

StAX is really nice in that it gives you control over the parsing, rather than using some callback interface through a handler like SAX.

There's yet another alternative: using XSLT. An XSLT stylesheet is the ideal way to filter out only relevant stuff. You could transform your input once to obtain the required fragments and process those. Or run multiple stylesheets over the same input to get the desired extract each time. An even nicer (and more efficient) solution, however, would be the use of extension functions and/or extension elements.

Extension functions can be implemented in a way that's independent from the XSLT processor being used. They're fairly straightforward to use in Java and I know for a fact that you can use them to pass complete XML extracts to a method, because I've done so already. Might take some experimentation, but it's a powerful mechanism. A DOM extract (or node) is probably one of the accepted parameter types for such a method. That'd leave the document building up to the XSLT processor which is even easier.

Extension elements are also very useful, but I think they need to be used in an implementation-specific manner. If you're okay with tying yourself to a specific JAXP setup like Xerces + Xalan, they might be the answer.

When going for XSLT, you'll have all the advantages of a full XPath 1.0 implementation, plus the peace of mind that comes from knowing XSLT is in really good shape in Java. It limits the building of the input tree to those nodes that are needed at any time and is blazing fast because the processors tend to compile stylesheets into Java bytecode rather than interpreting them. It is possible that using compilation instead of interpretation loses the possibility of using extension elements, though. Not certain about that. Extension functions are still possible.

Whatever way you choose, there's so much out there for XML processing in Java that you'll find plenty of help in implementing this, should you have no luck in finding a ready-made solution. That'd be the most obvious thing, of course... No need to reinvent the wheel when someone did the hard work.

Good luck!

EDIT: because I'm actually not feeling depressed for once, here's a demo using the StAX solution I whipped up. It's certainly not the cleanest code, but it'll give you the basic idea:

package staxdom;

import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;
import java.util.Stack;
import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class DOMExtractor {

    private final Set<String> paths;
    private final XMLInputFactory inputFactory;
    private final XMLOutputFactory outputFactory;
    private final DocumentBuilderFactory docBuilderFactory;
    private final Stack<QName> activeStack = new Stack<QName>();

    private boolean active = false;
    private String currentPath = "";

    public DOMExtractor(final Set<String> paths) {

        this.paths = Collections.unmodifiableSet(new HashSet<String>(paths));
        inputFactory = XMLInputFactory.newFactory();
        outputFactory = XMLOutputFactory.newFactory();
        docBuilderFactory = DocumentBuilderFactory.newInstance();

    }

    public void parse(final InputStream input) throws XMLStreamException, ParserConfigurationException, SAXException, IOException {

        final XMLEventReader reader = inputFactory.createXMLEventReader(input);
        XMLEventWriter writer = null;
        StringWriter buffer = null;
        final DocumentBuilder builder = docBuilderFactory.newDocumentBuilder();

        XMLEvent currentEvent = reader.nextEvent();

        do {

            if(active)
                writer.add(currentEvent);

            if(currentEvent.isEndElement()) {

                if(active) {

                    activeStack.pop();

                    if(activeStack.isEmpty()) {
                        writer.flush();
                        writer.close();
                        final Document doc;
                        final StringReader docReader = new StringReader(buffer.toString());
                        try {
                            doc = builder.parse(new InputSource(docReader));
                        } finally {
                            docReader.close();
                        }
                        //TODO: use doc
                        //Next bit is only for demo...
                        outputDoc(doc);
                        active = false;
                        writer = null;
                        buffer = null;
                    }

                }

                int index;
                if((index = currentPath.lastIndexOf('/')) >= 0)
                    currentPath = currentPath.substring(0, index);

            } else if(currentEvent.isStartElement()) {

                final StartElement start = (StartElement)currentEvent;
                final QName qName = start.getName();
                final String local = qName.getLocalPart();

                currentPath += "/" + local;

                if(!active && paths.contains(currentPath)) {

                    active = true;

                    buffer = new StringWriter();
                    writer = outputFactory.createXMLEventWriter(buffer);

                    writer.add(currentEvent);

                }

                if(active)
                    activeStack.push(qName);

            }

            currentEvent = reader.nextEvent();

        } while(!currentEvent.isEndDocument());

    }

    private void outputDoc(final Document doc) {


        try {
            final Transformer t = TransformerFactory.newInstance().newTransformer();
            t.transform(new DOMSource(doc), new StreamResult(System.out));
            System.out.println("");
            System.out.println("");
        } catch(TransformerException ex) {
            ex.printStackTrace();
        }

    }

    public static void main(String[] args) {

        final Set<String> paths = new HashSet<String>();
        paths.add("/root/one");
        paths.add("/root/three/embedded");

        final DOMExtractor me = new DOMExtractor(paths);

        InputStream stream = null;
        try {
            stream = DOMExtractor.class.getResourceAsStream("sample.xml");
            me.parse(stream);
        } catch(final Exception e) {
            e.printStackTrace();
        } finally {
            if(stream != null)
                try {
                    stream.close();
                } catch(IOException ex) {
                    ex.printStackTrace();
                }
        }

    }

}

And the sample.xml file (should be in the same package):

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <one>
        <two>this is text</two>
        look, I can even handle mixed!
    </one>
    ... not sure what to do with this, though
    <two>
        <willbeignored/>
    </two>
    <three>
        <embedded>
            <and><here><we><go>
                Creative Commons Legal Code

                Attribution 3.0 Unported

                    CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
                    LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN
                    ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
                    INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
                    REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR
                    DAMAGES RESULTING FROM ITS USE.

                License

                THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE
                COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY
                COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS
                AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.

                BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE
                TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY
                BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS
                CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND
                CONDITIONS.
            </go></we></here></and>
        </embedded>
    </three>
</root>

EDIT 2: Just noticed in Blaise Doughan's answer that there's a StAXSource. That'll be even more efficient. Use that if you're going with StAX. Will eliminate the need to keep some buffer. StAX allows you to "peek" at the next event, so you can check if it's a start element with the right path without consuming it before passing it into the transformer .

回复收藏 0 原文

゛清羽墨安 2024-12-20 10:05:46

好的，感谢您的代码片段，我终于得到了我的解决方案：

用法非常直观：

try 
        {
            /* CREATE THE PARSER  */
            XMLParser parser      = new XMLParser();
            /* CREATE THE FILTER (THIS IS A REGEX (X)PATH FILTER) */
            XMLRegexFilter filter = new XMLRegexFilter("statements/statement");
            /* CREATE THE HANDLER WHICH WILL BE CALLED WHEN A NODE IS FOUND */
            XMLHandler handler    = new XMLHandler()
            {
                public void nodeFound(StringBuilder node, XMLStackFilter withFilter)
                {
                    // DO SOMETHING WITH THE FOUND XML NODE
                    System.out.println("Node found");
                    System.out.println(node.toString());
                }
            };
            /* ATTACH THE FILTER WITH THE HANDLER */
            parser.addFilterWithHandler(filter, handler);
            /* SET THE FILE TO PARSE */
            parser.setFilePath("/path/to/bigfile.xml");
            /* RUN THE PARSER */
            parser.parse();
        } 
        catch (Exception ex) 
        {
            ex.printStackTrace();
        }

注意：

我制作了一个 XMLNodeFoundNotifier 和一个 XMLStackFilter 接口来轻松集成或构建您自己的处理程序/过滤器。
通常您应该能够使用此类解析非常大的文件。只有返回的节点才真正加载到内存中。
您可以通过取消注释代码中的正确部分来启用属性支持，出于简单原因我将其禁用。
您可以根据需要在每个处理程序中使用任意数量的过滤器，相反，

所有代码都在这里：

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Stack;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.xml.stream.*;

/* IMPLEMENT THIS TO YOUR CLASS IN ORDER TO TO BE NOTIFIED WHEN A NODE IS FOUND*/
interface XMLNodeFoundNotifier {

    abstract void nodeFound(StringBuilder node, XMLStackFilter withFilter);
}

/* A SMALL HANDER USEFULL FOR EXPLICIT CLASS DECLARATION */
abstract class XMLHandler implements XMLNodeFoundNotifier {
}

/* INTERFACE TO WRITE YOUR OWN FILTER BASED ON THE CURRENT NODES STACK (PATH)*/
interface XMLStackFilter {

    abstract boolean isRelevant(Stack fullPath);
}

/* A VERY USEFULL FILTER USING REGEX AS THE PATH FILTER */
class XMLRegexFilter implements XMLStackFilter {

    Pattern relevantExpression;

    XMLRegexFilter(String filterRules) {
        relevantExpression = Pattern.compile(filterRules);
    }

    /* HERE WE ARE ARE ASK TO TELL IF THE CURRENT STACK (LIST OF NODES) IS RELEVANT
     * OR NOT ACCORDING TO WHAT WE WANT. RETURN TRUE IF THIS IS THE CASE */
    @Override
    public boolean isRelevant(Stack fullPath) {
        /* A POSSIBLE CLEVER WAY COULD BE TO SERIALIZE THE WHOLE PATH (INCLUDING
         * ATTRIBUTES) TO A STRING AND TO MATCH IT WITH A REGEX BEING THE FILTER
         * FOR NOW StackToString DOES NOT SERIALIZE ATTRIBUTES */
        String stackPath = XMLParser.StackToString(fullPath);
        Matcher m = relevantExpression.matcher(stackPath);
        return  m.matches();
    }
}

/* THE MAIN PARSER'S CLASS */
public class XMLParser {

    HashMap<XMLStackFilter, XMLNodeFoundNotifier> filterHandler;
    HashMap<Integer, Integer> feedingStreams;
    Stack<HashMap> currentStack;
    String filePath;

    XMLParser() {
        currentStack   = new <HashMap>Stack();
        filterHandler  = new <XMLStackFilter, XMLNodeFoundNotifier> HashMap();
        feedingStreams = new <Integer, Integer>HashMap();
    }

    public void addFilterWithHandler(XMLStackFilter f, XMLNodeFoundNotifier h) {
        filterHandler.put(f, h);
    }

    public void setFilePath(String filePath) {
        this.filePath = filePath;
    }

    /* CONVERT A STACK OF NODES TO A REGULAR PATH STRING. NOTE THAT PER DEFAULT 
     * I DID NOT ADDED THE ATTRIBUTES INTO THE PATH. UNCOMENT THE LINKS ABOVE TO
     * DO SO
     */
    public static String StackToString(Stack<HashMap> s) {
        int k = s.size();
        if (k == 0) {
            return null;
        }
        StringBuilder out = new StringBuilder();
        out.append(s.get(0).get("tag"));
        for (int x = 1; x < k; ++x) {
            HashMap node = s.get(x);
            out.append('/').append(node.get("tag"));
            /* 
            // UNCOMMENT THIS TO ADD THE ATTRIBUTES SUPPORT TO THE PATH

            ArrayList <String[]>attributes = (ArrayList)node.get("attr");
            if (attributes.size()>0)
            {
            out.append("[");
            for (int i = 0 ; i<attributes.size(); i++)
            {
            String[]keyValuePair = attributes.get(i);
            if (i>0) out.append(",");
            out.append(keyValuePair[0]);
            out.append("=\"");
            out.append(keyValuePair[1]);
            out.append("\"");
            }
            out.append("]");
            }*/
        }
        return out.toString();
    }

    /*
     * ONCE A NODE HAS BEEN SUCCESSFULLY FOUND, WE GET THE DELIMITERS OF THE FILE
     * WE THEN RETRIEVE THE DATA FROM IT.
     */
    private StringBuilder getChunk(int from, int to) throws Exception {
        int length = to - from;
        FileReader f = new FileReader(filePath);
        BufferedReader br = new BufferedReader(f);
        br.skip(from);
        char[] readb = new char[length];
        br.read(readb, 0, length);
        StringBuilder b = new StringBuilder();
        b.append(readb);
        return b;
    }
    /* TRANSFORMS AN XSR NODE TO A HASHMAP NODE'S REPRESENTATION */
    public HashMap XSRNode2HashMap(XMLStreamReader xsr) {
        HashMap h = new HashMap();
        ArrayList attributes = new ArrayList();

        for (int i = 0; i < xsr.getAttributeCount(); i++) {
            String[] s = new String[2];
            s[0] = xsr.getAttributeName(i).toString();
            s[1] = xsr.getAttributeValue(i);
            attributes.add(s);
        }

        h.put("tag", xsr.getName());
        h.put("attr", attributes);

        return h;
    }

    public void parse() throws Exception {
        FileReader f         = new FileReader(filePath);
        XMLInputFactory xif  = XMLInputFactory.newInstance();
        XMLStreamReader xsr  = xif.createXMLStreamReader(f);
        Location previousLoc = xsr.getLocation();

        while (xsr.hasNext()) {
            switch (xsr.next()) {
                case XMLStreamConstants.START_ELEMENT:
                    currentStack.add(XSRNode2HashMap(xsr));
                    for (XMLStackFilter filter : filterHandler.keySet()) {
                        if (filter.isRelevant(currentStack)) {
                            feedingStreams.put(currentStack.hashCode(), new Integer(previousLoc.getCharacterOffset()));
                        }
                    }
                    previousLoc = xsr.getLocation();
                    break;

                case XMLStreamConstants.END_ELEMENT:
                    Integer stream = null;
                    if ((stream = feedingStreams.get(currentStack.hashCode())) != null) {
                        // FIND ALL THE FILTERS RELATED TO THIS FeedingStreem AND CALL THEIR HANDLER.
                        for (XMLStackFilter filter : filterHandler.keySet()) {
                            if (filter.isRelevant(currentStack)) {
                                XMLNodeFoundNotifier h = filterHandler.get(filter);

                                StringBuilder aChunk = getChunk(stream.intValue(), xsr.getLocation().getCharacterOffset());
                                h.nodeFound(aChunk, filter);
                            }
                        }
                        feedingStreams.remove(currentStack.hashCode());
                    }
                    previousLoc = xsr.getLocation();
                    currentStack.pop();
                    break;
                default:
                    break;
            }
        }
    }
}

ok thanks to your pieces of code, I finally end up with my solution:

Usage is quite intuitive:

try 
        {
            /* CREATE THE PARSER  */
            XMLParser parser      = new XMLParser();
            /* CREATE THE FILTER (THIS IS A REGEX (X)PATH FILTER) */
            XMLRegexFilter filter = new XMLRegexFilter("statements/statement");
            /* CREATE THE HANDLER WHICH WILL BE CALLED WHEN A NODE IS FOUND */
            XMLHandler handler    = new XMLHandler()
            {
                public void nodeFound(StringBuilder node, XMLStackFilter withFilter)
                {
                    // DO SOMETHING WITH THE FOUND XML NODE
                    System.out.println("Node found");
                    System.out.println(node.toString());
                }
            };
            /* ATTACH THE FILTER WITH THE HANDLER */
            parser.addFilterWithHandler(filter, handler);
            /* SET THE FILE TO PARSE */
            parser.setFilePath("/path/to/bigfile.xml");
            /* RUN THE PARSER */
            parser.parse();
        } 
        catch (Exception ex) 
        {
            ex.printStackTrace();
        }

Note:

I made a XMLNodeFoundNotifier and a XMLStackFilter interface to easily integrate or build your own handler / filter.
Normally you should be able to parse very large files with this class. Only the returned nodes are actually loaded into memory.
You can enable attributes support in uncommenting the right part in the code, I disabled it for simplicity reasons.
You can use as many filters per handler as you need and conversely

All the of the code is here:

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Stack;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.xml.stream.*;

/* IMPLEMENT THIS TO YOUR CLASS IN ORDER TO TO BE NOTIFIED WHEN A NODE IS FOUND*/
interface XMLNodeFoundNotifier {

    abstract void nodeFound(StringBuilder node, XMLStackFilter withFilter);
}

/* A SMALL HANDER USEFULL FOR EXPLICIT CLASS DECLARATION */
abstract class XMLHandler implements XMLNodeFoundNotifier {
}

/* INTERFACE TO WRITE YOUR OWN FILTER BASED ON THE CURRENT NODES STACK (PATH)*/
interface XMLStackFilter {

    abstract boolean isRelevant(Stack fullPath);
}

/* A VERY USEFULL FILTER USING REGEX AS THE PATH FILTER */
class XMLRegexFilter implements XMLStackFilter {

    Pattern relevantExpression;

    XMLRegexFilter(String filterRules) {
        relevantExpression = Pattern.compile(filterRules);
    }

    /* HERE WE ARE ARE ASK TO TELL IF THE CURRENT STACK (LIST OF NODES) IS RELEVANT
     * OR NOT ACCORDING TO WHAT WE WANT. RETURN TRUE IF THIS IS THE CASE */
    @Override
    public boolean isRelevant(Stack fullPath) {
        /* A POSSIBLE CLEVER WAY COULD BE TO SERIALIZE THE WHOLE PATH (INCLUDING
         * ATTRIBUTES) TO A STRING AND TO MATCH IT WITH A REGEX BEING THE FILTER
         * FOR NOW StackToString DOES NOT SERIALIZE ATTRIBUTES */
        String stackPath = XMLParser.StackToString(fullPath);
        Matcher m = relevantExpression.matcher(stackPath);
        return  m.matches();
    }
}

/* THE MAIN PARSER'S CLASS */
public class XMLParser {

    HashMap<XMLStackFilter, XMLNodeFoundNotifier> filterHandler;
    HashMap<Integer, Integer> feedingStreams;
    Stack<HashMap> currentStack;
    String filePath;

    XMLParser() {
        currentStack   = new <HashMap>Stack();
        filterHandler  = new <XMLStackFilter, XMLNodeFoundNotifier> HashMap();
        feedingStreams = new <Integer, Integer>HashMap();
    }

    public void addFilterWithHandler(XMLStackFilter f, XMLNodeFoundNotifier h) {
        filterHandler.put(f, h);
    }

    public void setFilePath(String filePath) {
        this.filePath = filePath;
    }

    /* CONVERT A STACK OF NODES TO A REGULAR PATH STRING. NOTE THAT PER DEFAULT 
     * I DID NOT ADDED THE ATTRIBUTES INTO THE PATH. UNCOMENT THE LINKS ABOVE TO
     * DO SO
     */
    public static String StackToString(Stack<HashMap> s) {
        int k = s.size();
        if (k == 0) {
            return null;
        }
        StringBuilder out = new StringBuilder();
        out.append(s.get(0).get("tag"));
        for (int x = 1; x < k; ++x) {
            HashMap node = s.get(x);
            out.append('/').append(node.get("tag"));
            /* 
            // UNCOMMENT THIS TO ADD THE ATTRIBUTES SUPPORT TO THE PATH

            ArrayList <String[]>attributes = (ArrayList)node.get("attr");
            if (attributes.size()>0)
            {
            out.append("[");
            for (int i = 0 ; i<attributes.size(); i++)
            {
            String[]keyValuePair = attributes.get(i);
            if (i>0) out.append(",");
            out.append(keyValuePair[0]);
            out.append("=\"");
            out.append(keyValuePair[1]);
            out.append("\"");
            }
            out.append("]");
            }*/
        }
        return out.toString();
    }

    /*
     * ONCE A NODE HAS BEEN SUCCESSFULLY FOUND, WE GET THE DELIMITERS OF THE FILE
     * WE THEN RETRIEVE THE DATA FROM IT.
     */
    private StringBuilder getChunk(int from, int to) throws Exception {
        int length = to - from;
        FileReader f = new FileReader(filePath);
        BufferedReader br = new BufferedReader(f);
        br.skip(from);
        char[] readb = new char[length];
        br.read(readb, 0, length);
        StringBuilder b = new StringBuilder();
        b.append(readb);
        return b;
    }
    /* TRANSFORMS AN XSR NODE TO A HASHMAP NODE'S REPRESENTATION */
    public HashMap XSRNode2HashMap(XMLStreamReader xsr) {
        HashMap h = new HashMap();
        ArrayList attributes = new ArrayList();

        for (int i = 0; i < xsr.getAttributeCount(); i++) {
            String[] s = new String[2];
            s[0] = xsr.getAttributeName(i).toString();
            s[1] = xsr.getAttributeValue(i);
            attributes.add(s);
        }

        h.put("tag", xsr.getName());
        h.put("attr", attributes);

        return h;
    }

    public void parse() throws Exception {
        FileReader f         = new FileReader(filePath);
        XMLInputFactory xif  = XMLInputFactory.newInstance();
        XMLStreamReader xsr  = xif.createXMLStreamReader(f);
        Location previousLoc = xsr.getLocation();

        while (xsr.hasNext()) {
            switch (xsr.next()) {
                case XMLStreamConstants.START_ELEMENT:
                    currentStack.add(XSRNode2HashMap(xsr));
                    for (XMLStackFilter filter : filterHandler.keySet()) {
                        if (filter.isRelevant(currentStack)) {
                            feedingStreams.put(currentStack.hashCode(), new Integer(previousLoc.getCharacterOffset()));
                        }
                    }
                    previousLoc = xsr.getLocation();
                    break;

                case XMLStreamConstants.END_ELEMENT:
                    Integer stream = null;
                    if ((stream = feedingStreams.get(currentStack.hashCode())) != null) {
                        // FIND ALL THE FILTERS RELATED TO THIS FeedingStreem AND CALL THEIR HANDLER.
                        for (XMLStackFilter filter : filterHandler.keySet()) {
                            if (filter.isRelevant(currentStack)) {
                                XMLNodeFoundNotifier h = filterHandler.get(filter);

                                StringBuilder aChunk = getChunk(stream.intValue(), xsr.getLocation().getCharacterOffset());
                                h.nodeFound(aChunk, filter);
                            }
                        }
                        feedingStreams.remove(currentStack.hashCode());
                    }
                    previousLoc = xsr.getLocation();
                    currentStack.pop();
                    break;
                default:
                    break;
            }
        }
    }
}

回复收藏 0 原文