带有 DOMXpath 查询/评估的 xpath 太长，不返回任何内容

发布于 2024-09-30 11:24:16 字数 2050 浏览 6 评论 0原文

我正在使用 PHP 检索给定 URL 和 XPATH 的内容。我使用 DOMDocument / DOMXPath （带有查询或评估）。

对于较小的 xpath，我获得了正确的结果，但对于较长的 xpath，它不起作用。（这个xpath似乎很好（我用Xpather（firefox插件）获得它们并用YQL重新测试它们）。

你对这个奇怪的麻烦有什么建议吗？

代码示例：

$doc = new DOMDocument();
$myXMLString = file_get_contents('http://stackoverflow.com/questions/4097230/too-long-xpath-with-domxpath-query-evaluate-return-nothing');
@$doc->loadHTML($myXMLString); //@ to suppress warnings 
                               //(good for not ending markup)
$xpath = new DOMXPath($doc);

$fullPath ="/html/body/small/path"; //it works
//$fullPath = "/html/body/full/path/with/lot/of/markup";//does not works
$entries = $xpath->query($fullPath);
//or ->evalutate($fullPath) (same behaviour)
//$entries return DOMNodeList (empty for a long path query, 
//                             correct for a small path query)

我使用属性限制进行测试，但似乎不改变（使用较小的 xpath 可以工作，使用较长的 xpath 则不起作用）

示例：对于当前页面：

$fullPath = "/html
              /body
               /div[4]
                /div[@id='content']
                 /div[@id='question-header']
                  /h1
                   /a";//works (retrieve the question title)
$fullPath = "/html
              /body
               /div[4]
                /div[@id='content']
                 /div[@id='mainbar']
                  /div[@id='question']
                   /table
                    /tbody
                     /tr[2]
                      /td[2]
                       /div[@id='comments-4097230']
                        /table
                         /tbody
                          /tr[@id='comment-4408626']
                           /td[2]
                            /div
                             /a"; //does'nt work 
                                  //(should retrieve 'gaby' from comment)

编辑：

我使用 SimpleXML lib 进行测试，并且我具有完全相同的行为（对于小型查询有良好的结果，对于长查询没有任何结果）。

编辑2：

我还通过删除一些第一个元素来剪切最长的xpath，它起作用了。顺便说一句，我真的不明白为什么完全正确的 xpath 不起作用。

原文

I am using PHP to retrieve content for a given URL and XPATH.
I use DOMDocument / DOMXPath (with query or evaluate).

For small xpath, I obtain correct result, but for longer xpath, it does not work. (And this xpath seems to be good (I obtained them with Xpather (firefox plugin) and re-test them with YQL).

Do you have any advice on this curious trouble ?

Example of code:

$doc = new DOMDocument();
$myXMLString = file_get_contents('http://stackoverflow.com/questions/4097230/too-long-xpath-with-domxpath-query-evaluate-return-nothing');
@$doc->loadHTML($myXMLString); //@ to suppress warnings 
                               //(good for not ending markup)
$xpath = new DOMXPath($doc);

$fullPath ="/html/body/small/path"; //it works
//$fullPath = "/html/body/full/path/with/lot/of/markup";//does not works
$entries = $xpath->query($fullPath);
//or ->evalutate($fullPath) (same behaviour)
//$entries return DOMNodeList (empty for a long path query, 
//                             correct for a small path query)

I test with attribute restriction, but is seems to not change (with small xpath it works, with longer it do not works more)

Example :
for this current page:

$fullPath = "/html
              /body
               /div[4]
                /div[@id='content']
                 /div[@id='question-header']
                  /h1
                   /a";//works (retrieve the question title)
$fullPath = "/html
              /body
               /div[4]
                /div[@id='content']
                 /div[@id='mainbar']
                  /div[@id='question']
                   /table
                    /tbody
                     /tr[2]
                      /td[2]
                       /div[@id='comments-4097230']
                        /table
                         /tbody
                          /tr[@id='comment-4408626']
                           /td[2]
                            /div
                             /a"; //does'nt work 
                                  //(should retrieve 'gaby' from comment)

Edit:

I test with SimpleXML lib, and I have exactly the same behavior (good result for small query, nothing for long query).

Edit 2:

I also cut the longest xpath by deleting some first element and it works.
BTW I really do not understand why a full correct xpath does not work.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

东京女 2024-10-07 11:24:16

让我们逐步完成这个过程：

第 1 步：复制错误。

在验证 XPath 确实不会返回结果后，我编写了一个小脚本来查看 XPath 在崩溃之前会深入到多深

foreach (explode('/', $fullPath) as $segment) {
    $xpath .= trim($segment);
    echo '-------------------------------------------', PHP_EOL,
         'Trying: ', $xpath, PHP_EOL,
         '-------------------------------------------', PHP_EOL;
    echo $xp->evaluate("string($xpath)"), PHP_EOL;
    $xpath .= '/';
}

它返回结果的最后一件事是

/html/body/div[4]/div[@id='content']/div[@id='mainbar']/div[@id='question']/table

第 2 步：检查标记

因此，我检查了 DOMDocument::saveHTML() 返回的标记，看看它是什么样子的，没有 （为了可读性而重新格式化）：

<div id="question">
    <div class="everyonelovesstackoverflow" id="adzerk1"></div>
        <table>
            <tr><td class="votecell">

然后我检查了这个页面，看看是 DOM 将其丢弃还是它确实不存在。它不在那里。显然，Firebug 插入了它，这可以解释为什么你用 XPather 得到结果（但不是为什么你用 YQL 得到结果）：

Screenshot显示页面源代码和明显有问题的 Firebug 视图

第 3 步：校对和结论

我从 XPath 中删除了并重新运行脚本。没问题。返回“盖比”。

虽然我首先怀疑 Firebug 中存在错误，但 Alejandro 评论说这种情况也会在 IE 的 DeveloperTools 中发生。然后我怀疑这是由 JavaScript 添加的，但无法验证这一点。经过更多研究后，Alejandro 向我指出为什么 firebug 添加 ; 到 ? - 它实际上既不是 Firebug 也不是 JavaScript，而是浏览器本身。

因此修改我的结论：

不要信任您在浏览器中看到的呈现的标记，因为它可能会被浏览器或其他技术修改。 DOM 只会下载直接提供的内容。如果您再次遇到类似的问题，您现在知道如何解决它。

一些额外的旁注

除非您需要在将标记提供给 DOM 之前修改标记，否则您不必使用 file_get_contents() 来加载内容。您可以使用 DOM 的 loadHTMLFile()：

$dom->loadHTMLFile('http://www.example.com/foo.htm');

另外，抑制错误的正确方法是告诉 libxml 使用它的内部错误处理程序。但您无需处理错误，只需清除它们即可。这只会影响与 libxml 相关的错误，例如解析错误（而不是所有 PHP 错误）：

libxml_use_internal_errors(TRUE);
libxml_clear_errors();

最后，可以针对上下文节点执行 xPath 查询。因此，虽然长 XPath 在查找时间方面非常高效，但您可以简单地使用 getElementById() 来获取最深的已知节点，然后对其使用 XPath。

换句话说：

libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.example.com/foo.htm');
libxml_clear_errors();
echo $xp->evaluate(
    'string(td[2]/div/a)', 
    $dom->getElementById('comment-4408626'));

也会返回“Gaby”。

Let's go through this step by step:

Step 1: replicating the error.

After verifying that the XPath will indeed not return a result, I wrote a little script to see how deep the XPath will go before it breaks

foreach (explode('/', $fullPath) as $segment) {
    $xpath .= trim($segment);
    echo '-------------------------------------------', PHP_EOL,
         'Trying: ', $xpath, PHP_EOL,
         '-------------------------------------------', PHP_EOL;
    echo $xp->evaluate("string($xpath)"), PHP_EOL;
    $xpath .= '/';
}

The last thing it will return a result for is

/html/body/div[4]/div[@id='content']/div[@id='mainbar']/div[@id='question']/table

Step 2: checking the markup

So I checked the markup returned by DOMDocument::saveHTML() to see what it looks like and there was no <tbody> (reformatted for readability):

<div id="question">
    <div class="everyonelovesstackoverflow" id="adzerk1"></div>
        <table>
            <tr><td class="votecell">

I then checked this very page to see if it was DOM throwing it away or if it really does not exist. It wasn't there. Apparently, Firebug inserts it, which would explain why you got the result with XPather (but not why you got it with YQL):

Screenshot showing page source and apparently bugged Firebug view

Step 3: proofchecking and conclusion

I removed the <tbody> from the XPath and reran the script. No problems. Returns "Gaby".

While I suspected a bug in Firebug first, Alejandro commented this would happen in IE's DeveloperTools, too. I then suspected this to be added by JavaScript but could not verify that. After some more research Alejandro pointed me to Why does firebug add <tbody> to <table>? - it's actually neither Firebug nor JavaScript though, but the browser's themselves.

So to modify my conclusion:

Dont trust markup you see rendered in the browser, because it may be modified by the browser or other technologies. DOM will only download what is is served directly. If you run into similar issues again, you now know how to approach it though.

Some additional sidenotes

Unless you need to modify the markup before feeding it to DOM, you do not have to use file_get_contents() to load the content. You can use DOM's loadHTMLFile():

$dom->loadHTMLFile('http://www.example.com/foo.htm');

Also, the proper way to suppress errors is to tell libxml to use it's internal error handler. But instead of handling the errors, you simply clear them. This will only affect errors relating to libxml, e.g. parsing errors (as opposed to all PHP errors):

libxml_use_internal_errors(TRUE);
libxml_clear_errors();

Finally, xPath queries can be done in relation to a context node. So while the long XPath is efficient in terms of lookup time, you could simply use getElementById() to get the deepest known node and then use an XPath against it.

In other words:

libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.example.com/foo.htm');
libxml_clear_errors();
echo $xp->evaluate(
    'string(td[2]/div/a)', 
    $dom->getElementById('comment-4408626'));

will return "Gaby" as well.

回复收藏 0 原文

~没有更多了~