加速 xpath

发布于 2024-08-23 21:49:35 字数 2600 浏览 7 评论 0原文

我有一个 1000 个条目的文档,其格式类似于:

<Example>
     <Entry>
          <n1></n1>
          <n2></n2>
      </Entry>
      <Entry>
          <n1></n1>
          <n2></n2>
      </Entry>
      <!--and so on-->

这里有超过 1000 个条目节点。我正在编写一个Java程序,它基本上一一获取所有节点并对每个节点进行一些分析。但问题是节点的检索时间随着节点号的增加而增加。例如,检索第一个节点需要 78 毫秒,检索第二个节点需要 100 毫秒,而且这个数字还在不断增加。检索 999 个节点需要超过 5 秒的时间。这非常慢。我们会将此代码插入到包含超过 1000 个条目的 XML 文件中。有些人喜欢数百万。解析整个文档的总时间超过5分钟。

我正在使用这个简单的代码来遍历它。这里 nxp 是我自己的类,它具有从 xpath 获取节点的所有方法。

nxp.fromXpathToNode("/Example/Entry" + "[" + i  + "]", doc);    

doc 是该文件的文档。 i 是要检索的节点号。

另外,当我尝试这样的事情时,

List<Node> nl = nxp.fromXpathToNodes("/Example/Entry",doc);  
      content = nl.get(i);    

我也面临同样的问题。

任何人都有关于如何加快节点检索速度的解决方案,因此从 XML 文件中获取第 1 个节点和第 1000 个节点需要相同的时间。


这是 xpathtonode 的代码。

public Node fromXpathToNode(String expression, Node context)  
{  
    try  
    {  
        return (Node)this.getCachedExpression(expression).evaluate(context, XPathConstants.NODE);  
    }  
    catch (Exception cause)  
    {  
        throw new RuntimeException(cause);  
    }  
}  

这是 fromxpathtonodes 的代码。

public List<Node> fromXpathToNodes(String expression, Node context)  
{  
    List<Node> nodes = new ArrayList<Node>();  
    NodeList results = null;  
    
    try  
    {  
        results = (NodeList)this.getCachedExpression(expression).evaluate(context, XPathConstants.NODESET);  
          
        for (int index = 0; index < results.getLength(); index++)  
        {  
            nodes.add(results.item(index));  
        }  
    }  
    catch (Exception cause)  
    {  
        throw new RuntimeException(cause);  
    }  
    
    return nodes;  
}  

这是开始

public class NativeXpathEngine implements XpathEngine  
{      
private final XPathFactory factory;  
  
private final XPath engine;  

/**
 * Cache for previously compiled XPath expressions. {@link XPathExpression#hashCode()}
 * is not reliable or consistent so use the textual representation instead.
 */  
private final Map<String, XPathExpression> cachedExpressions;  
  
public NativeXpathEngine()  
{
    super();  
    
    this.factory = XPathFactory.newInstance();  
    this.engine = factory.newXPath();  
    this.cachedExpressions = new HashMap<String, XPathExpression>();  
}  

I have a 1000 entry document whose format is something like:

<Example>
     <Entry>
          <n1></n1>
          <n2></n2>
      </Entry>
      <Entry>
          <n1></n1>
          <n2></n2>
      </Entry>
      <!--and so on-->

There are more than 1000 Entry nodes here. I am writing a Java program which basically gets all the node one by one and do some analyzing on each node. But the problem is that the retrieval time of the nodes increases with its no. For example it takes 78 millisecond to retrieve the first node 100 ms to retrieve the second and it keeps on increasing. And to retrieve the 999 node it takes more than 5 second. This is extremely slow. We would be plugging this code to XML files which have even more than 1000 entries. Some like millions. The total time to parse the whole document is more than 5 minutes.

I am using this simple code to traverse it. Here nxp is my own class which has all the methods to get nodes from xpath.

nxp.fromXpathToNode("/Example/Entry" + "[" + i  + "]", doc);    

and doc is the document for the file. i is the no of node to retrieve.

Also when i try something like this

List<Node> nl = nxp.fromXpathToNodes("/Example/Entry",doc);  
      content = nl.get(i);    

I face the same problem.

Anyone has any solution on how to speed up the tretirival of the nodes, so it takes the same amount of time to get the 1st node as well as the 1000 node from the XML file.


Here is the code for xpathtonode.

public Node fromXpathToNode(String expression, Node context)  
{  
    try  
    {  
        return (Node)this.getCachedExpression(expression).evaluate(context, XPathConstants.NODE);  
    }  
    catch (Exception cause)  
    {  
        throw new RuntimeException(cause);  
    }  
}  

and here is the code for fromxpathtonodes.

public List<Node> fromXpathToNodes(String expression, Node context)  
{  
    List<Node> nodes = new ArrayList<Node>();  
    NodeList results = null;  
    
    try  
    {  
        results = (NodeList)this.getCachedExpression(expression).evaluate(context, XPathConstants.NODESET);  
          
        for (int index = 0; index < results.getLength(); index++)  
        {  
            nodes.add(results.item(index));  
        }  
    }  
    catch (Exception cause)  
    {  
        throw new RuntimeException(cause);  
    }  
    
    return nodes;  
}  

and here is the starting

public class NativeXpathEngine implements XpathEngine  
{      
private final XPathFactory factory;  
  
private final XPath engine;  

/**
 * Cache for previously compiled XPath expressions. {@link XPathExpression#hashCode()}
 * is not reliable or consistent so use the textual representation instead.
 */  
private final Map<String, XPathExpression> cachedExpressions;  
  
public NativeXpathEngine()  
{
    super();  
    
    this.factory = XPathFactory.newInstance();  
    this.engine = factory.newXPath();  
    this.cachedExpressions = new HashMap<String, XPathExpression>();  
}  

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

一萌ing 2024-08-30 21:49:35

尝试VTD-XML。它使用的内存比 DOM 少。它比SAX更容易使用并且支持XPath。这里有一些示例代码可以帮助您入门。它应用 XPath 来获取 Entry 元素,然后打印出 n1 和 n2 子元素。

final VTDGen vg = new VTDGen();
vg.parseFile("/path/to/file.xml", false);

final VTDNav vn = vg.getNav();
final AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/Example/Entry");
int count = 1;
while (ap.evalXPath() != -1) {
    System.out.println("Inside Entry: " + count);

    //move to n1 child
    vn.toElement(VTDNav.FIRST_CHILD, "n1");
    System.out.println("\tn1: " + vn.toNormalizedString(vn.getText()));

    //move to n2 child
    vn.toElement(VTDNav.NEXT_SIBLING, "n2");
    System.out.println("\tn2: " + vn.toNormalizedString(vn.getText()));

    //move back to parent
    vn.toElement(VTDNav.PARENT);
    count++;
}

Try VTD-XML. It uses less memory than DOM. It is easier to use than SAX and supports XPath. Here is some sample code to help you get started. It applies an XPath to get the Entry elements and then prints out the n1 and n2 child elements.

final VTDGen vg = new VTDGen();
vg.parseFile("/path/to/file.xml", false);

final VTDNav vn = vg.getNav();
final AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/Example/Entry");
int count = 1;
while (ap.evalXPath() != -1) {
    System.out.println("Inside Entry: " + count);

    //move to n1 child
    vn.toElement(VTDNav.FIRST_CHILD, "n1");
    System.out.println("\tn1: " + vn.toNormalizedString(vn.getText()));

    //move to n2 child
    vn.toElement(VTDNav.NEXT_SIBLING, "n2");
    System.out.println("\tn2: " + vn.toNormalizedString(vn.getText()));

    //move back to parent
    vn.toElement(VTDNav.PARENT);
    count++;
}
拔了角的鹿 2024-08-30 21:49:35

正确的解决方案是在调用 item(i) 之后立即分离节点,如下所示:

Node node = results.item(index)
node.getParentNode().removeChild(node)
nodes.add(node)

请参阅 XPath.evaluate 性能因多次调用而降低(荒谬地)

The correct solution is to detach the node right after you call item(i), like so:

Node node = results.item(index)
node.getParentNode().removeChild(node)
nodes.add(node)

See XPath.evaluate performance slows down (absurdly) over multiple calls

浮光之海 2024-08-30 21:49:35

我在 Xpath Evaluation 上遇到了类似的问题,我尝试使用 CachedXPathAPI,它比之前使用的 XPathApi 快 100 倍。
此处提供了有关此 API 的更多信息:
http://xml.apache.org/xalan-j /apidocs/org/apache/xpath/CachedXPathAPI.html

希望有帮助。
干杯,
马杜苏丹

I had similar issue with the Xpath Evaluation , I tried using CachedXPathAPI’s which is faster by 100X than the XPathApi’s which was used earlier.
more information about this Api is provided here :
http://xml.apache.org/xalan-j/apidocs/org/apache/xpath/CachedXPathAPI.html

Hope it helps.
Cheers,
Madhusudhan

咿呀咿呀哟 2024-08-30 21:49:35

如果您需要解析巨大但平面的文档,SAX 是一个不错的选择。它允许您将 XML 作为流来处理,而不是构建一个巨大的 DOM。您的示例可以使用 ContentHandler 进行解析,如下所示:

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.ext.DefaultHandler2;

public class ExampleHandler extends DefaultHandler2 {

    private StringBuffer chars = new StringBuffer(1000);

    private MyEntry currentEntry;
    private MyEntryHandler myEntryHandler;

    ExampleHandler(MyEntryHandler myEntryHandler) {
        this.myEntryHandler = myEntryHandler;
    }

    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        chars.append(ch);
    }

    @Override
    public void endElement(String uri, String localName, String qName)
            throws SAXException {
        if ("Entry".equals(localName)) {
            myEntryHandler.handle(currentEntry);
            currentEntry = null;
        }
        else if ("n1".equals(localName)) {
            currentEntry.setN1(chars.toString());
        }
        else if ("n2".equals(localName)) {
            currentEntry.setN2(chars.toString());
        }
    }


    @Override
    public void startElement(String uri, String localName, String qName,
            Attributes atts) throws SAXException {
        chars.setLength(0);
        if ("Entry".equals(localName)) {
            currentEntry = new MyEntry();
        }
    }
}

如果文档具有更深、更复杂的结构,您将需要使用 Stack 来跟踪文档中的当前路径。然后您应该考虑编写一个通用的 ContentHandler 来完成脏工作并与文档类型相关的处理程序一起使用。

If you need to parse huge but flat documents, SAX is a good alternative. It allows you to handle the XML as a stream instead of building a huge DOM. Your example could be parsed using a ContentHandler like this:

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.ext.DefaultHandler2;

public class ExampleHandler extends DefaultHandler2 {

    private StringBuffer chars = new StringBuffer(1000);

    private MyEntry currentEntry;
    private MyEntryHandler myEntryHandler;

    ExampleHandler(MyEntryHandler myEntryHandler) {
        this.myEntryHandler = myEntryHandler;
    }

    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        chars.append(ch);
    }

    @Override
    public void endElement(String uri, String localName, String qName)
            throws SAXException {
        if ("Entry".equals(localName)) {
            myEntryHandler.handle(currentEntry);
            currentEntry = null;
        }
        else if ("n1".equals(localName)) {
            currentEntry.setN1(chars.toString());
        }
        else if ("n2".equals(localName)) {
            currentEntry.setN2(chars.toString());
        }
    }


    @Override
    public void startElement(String uri, String localName, String qName,
            Attributes atts) throws SAXException {
        chars.setLength(0);
        if ("Entry".equals(localName)) {
            currentEntry = new MyEntry();
        }
    }
}

If the document has a deeper and more complex structure, you're going to need to use Stacks to keep track of the current path in the document. Then you should consider writing a general purpose ContentHandler to do the dirty work and use with your document type dependent handlers.

何以畏孤独 2024-08-30 21:49:35

您使用什么类型的解析器?

DOM 将整个文档拉入内存 - 一旦将整个文档拉入内存,您的操作就会很快,但在 Web 应用程序或 for 循环中这样做可能会产生影响。

SAX 解析器按需解析并在您请求时加载节点。

因此,请尝试使用适合您需要的解析器实现。

What kind of parser are you using?

DOM pulls the whole document in memory - once you pull the whole document in memory then your operations can be fast but doing so in a web app or a for loop can have an impact.

SAX parser does on demand parsing and loads nodes as and when you request.

So try to use a parser implementation that suits your need.

我很OK 2024-08-30 21:49:35

使用 xpath 的 JAXEN 库:
http://jaxen.codehaus.org/

Use the JAXEN library for xpaths:
http://jaxen.codehaus.org/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文