这个xpath查询(PHP)如何更灵活?

发布于 2024-08-23 04:59:32 字数 3143 浏览 6 评论 0原文

我正在使用 PHP 的 SimpleXML 解析 XHTML 文档。我需要在文档中查询一系列 ul 来查找包含特定值的节点,然后找到该节点的父节点的直接前一个兄弟节点...代码将有助于解释!

给定以下虚拟 xhtml:

<html>
<head></head>
<body>
...

<ul class="attr-list"> 
    <li>Active Life (active)</li> 
    <ul> 
        <li>Amateur Sports Teams (amateursportsteams)</li> 
        <li>Amusement Parks (amusementparks)</li> 
        <li>Fitness & Instruction (fitness)</li> 
        <ul> 
            <li>Dance Studios (dancestudio)</li> 
            <li>Gyms (gyms)</li> 
            <li>Martial Arts (martialarts)</li> 
            <li>Pilates (pilates)</li> 
            <li>Swimming Lessons/Schools (swimminglessons)</li>  
        </ul> 
        <li>Go Karts (gokarts)</li> 
        <li>Mini Golf (mini_golf)</li> 
        <li>Parks (parks)</li> 
        <ul> 
            <li>Dog Parks (dog_parks)</li> 
            <li>Skate Parks (skate_parks)</li> 
        </ul> 
        <li>Playgrounds (playgrounds)</li> 
        <li>Rafting/Kayaking (rafting)</li> 
        <li>Tennis (tennis)</li> 
        <li>Zoos (zoos)</li> 
    </ul> 
    <li>Arts & Entertainment (arts)</li> 
    <ul> 
        <li>Arcades (arcades)</li> 
        <li>Art Galleries (galleries)</li> 
        <li>Wineries (wineries)</li> 
    </ul> 
    <li>Automotive (auto)</li> 
    <ul> 
        <li>Auto Detailing (auto_detailing)</li> 
        <li>Auto Glass Services (autoglass)</li> 
        <li>Auto Parts & Supplies (autopartssupplies)</li> 
    </ul>
    <li>Nightlife (nightlife)</li>
    <ul>
        <li>Bars (bars)</li>
        <ul>
            <li>Dive Bars (divebars)</li>
        </ul>
    </ul>
</ul>

...
</body>
</html>

我需要能够查询 ul.attr-list 中的子元素,并发现其“根”类别。我无法更改 xhtml 以不同的方式形成。

所以,如果我有“画廊”作为一个类别,我需要知道它属于“艺术”“根”类别。或者,如果我有“dog_parks”,我需要知道它属于“活动”类别。下面的代码完成了这项工作,但前提是最多有两个嵌套级别:

function get_root_category($shortCategoryName){

    $url = "http://www.yelp.com/developers/documentation/category_list";
    $result = file_get_contents($url);

    $dom = new domDocument();
    @$dom->loadHTML($result);
    $dom->preserveWhiteSpace = false;

    $sxml = simplexml_import_dom($dom);

    $lvl1 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li");
    $lvl2 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li/parent::ul/preceding-sibling::li");

    if($lvl2){
        return array_pop($lvl2);
    } else {
        return array_pop($lvl1);
    }
}

必须有一种更好的方法来编写该 XPath,以便只需要进行一个查询,并且对于多个查询相对来说是无懈可击的,嵌套级别。

编辑:: 感谢那些指出此 HTML 无效的人。但是,页面结构已定,无法编辑;我只能把它当作一种资源来使用,并且必须按其本来面目去对待它。

I'm parsing an XHTML document using PHP's SimpleXML. I need to query a series of ul's in the document for a node containing a specific value, then find that node's parent's direct previous sibling... code will help explain!

Given the following dummy xhtml:

<html>
<head></head>
<body>
...

<ul class="attr-list"> 
    <li>Active Life (active)</li> 
    <ul> 
        <li>Amateur Sports Teams (amateursportsteams)</li> 
        <li>Amusement Parks (amusementparks)</li> 
        <li>Fitness & Instruction (fitness)</li> 
        <ul> 
            <li>Dance Studios (dancestudio)</li> 
            <li>Gyms (gyms)</li> 
            <li>Martial Arts (martialarts)</li> 
            <li>Pilates (pilates)</li> 
            <li>Swimming Lessons/Schools (swimminglessons)</li>  
        </ul> 
        <li>Go Karts (gokarts)</li> 
        <li>Mini Golf (mini_golf)</li> 
        <li>Parks (parks)</li> 
        <ul> 
            <li>Dog Parks (dog_parks)</li> 
            <li>Skate Parks (skate_parks)</li> 
        </ul> 
        <li>Playgrounds (playgrounds)</li> 
        <li>Rafting/Kayaking (rafting)</li> 
        <li>Tennis (tennis)</li> 
        <li>Zoos (zoos)</li> 
    </ul> 
    <li>Arts & Entertainment (arts)</li> 
    <ul> 
        <li>Arcades (arcades)</li> 
        <li>Art Galleries (galleries)</li> 
        <li>Wineries (wineries)</li> 
    </ul> 
    <li>Automotive (auto)</li> 
    <ul> 
        <li>Auto Detailing (auto_detailing)</li> 
        <li>Auto Glass Services (autoglass)</li> 
        <li>Auto Parts & Supplies (autopartssupplies)</li> 
    </ul>
    <li>Nightlife (nightlife)</li>
    <ul>
        <li>Bars (bars)</li>
        <ul>
            <li>Dive Bars (divebars)</li>
        </ul>
    </ul>
</ul>

...
</body>
</html>

I need to be able to query the ul.attr-list for a child element, and discover its "root" category. I cannot change the xhtml to be formed differently.

So, if I have "galleries" as a category, I need to know that it is in the "arts" "root" category. Or, if I have "dog_parks", I need to know that it is in the "active" category. The following code gets the job done, but only with the assumption that at max there are two nested levels:

function get_root_category($shortCategoryName){

    $url = "http://www.yelp.com/developers/documentation/category_list";
    $result = file_get_contents($url);

    $dom = new domDocument();
    @$dom->loadHTML($result);
    $dom->preserveWhiteSpace = false;

    $sxml = simplexml_import_dom($dom);

    $lvl1 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li");
    $lvl2 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li/parent::ul/preceding-sibling::li");

    if($lvl2){
        return array_pop($lvl2);
    } else {
        return array_pop($lvl1);
    }
}

There has to be a better way to write that XPath, so that only one query needs to be made, and is relatively bulletproof to multiple, nested levels.

EDIT:: Thanks to those that pointed out that this HTML is not valid. However, the structure of the page is set, and I cannot edit it; I can only use it as a resource, and have to make due with what it is.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

花桑 2024-08-30 04:59:32

我需要查询一系列 ul
包含一个节点的文档
具体值,然后找到该节点的
父母的前任兄弟姐妹...

那就是(这里 $v 是您要查找的值):

$p = "/html/body//ul[li[contains(text(), '$v')]]/preceding-sibling::li[1]";
  • 确保检查 $v 不包含单引号,因为这会破坏 XPath 表达式。
  • 当您只想查找整个单词时,请使用:
    [包含(concat(' ', text(), ' '), concat(' ', '$v', ' '))]
  • 当您想不区分大小写时,请使用(我用 ... 缩写了完整的字母表):
    [包含(translate(text(), 'ABC…XYZ', 'abc…xyz'), '{strtolower($v)}')]
  • 请注意,谓词可以嵌套。
  • 请注意,使用 text() 确保仅考虑直接子文本节点。当您使用 . 时,
  • 的整个“子树”将转换为字符串,您可能会得到比实际想要的更多的结果。
  • 请注意,我将 // 运算符(后代 轴的快捷方式)限制为树的特定部分 - 如果您可以进一步限制它,请务必这样做.
    让 XPath 以 // 开头会比实际需要的速度慢得多,因为整个文档的所有节点都会被检查,即使是那些在任何情况下都不能生成匹配。
  • 正如其他人已经指出的那样,HTML 无效。

I need to query a series of ul's in
the document for a node containing a
specific value, then find that node's
parent's direct previous sibling...

That would be (here $v is the value you look for):

$p = "/html/body//ul[li[contains(text(), '$v')]]/preceding-sibling::li[1]";
  • Make sure that you check that $v does not contain single quotes, since this would break the XPath expression.
  • When you want to look for whole words only, use:
    [contains(concat(' ', text(), ' '), concat(' ', '$v', ' '))].
  • When you want to look case-insentitively, use (I abbreviated the full alphabet with ):
    [contains(translate(text(), 'ABC…XYZ', 'abc…xyz'), '{strtolower($v)}')].
  • Note that predicates can be nested.
  • Note that the use of text() ensures only direct child text nodes are taken into account. When you use . instead, the whole "subtree" of the <li> is converted to string and you might get more results than you actually want.
  • Note that I restricted the // operator (a shortcut for the descendant axis) to a certain part of the tree - if you can restrict it further, by all means do so.
    Letting your XPath start with // makes it much slower than it needs to be since all nodes of the entire document are checked, even those that can not under any circumstances produce a match.
  • As others have already noted, the HTML is invalid.
つ低調成傷 2024-08-30 04:59:32

怎么样:

/html/body/ul/ul[count(descendant::li[contains(.,'dog_parks')]) > 0]/preceding-sibling::li

这应该适用于深度嵌套的列表。它始终获得最上面的类别。

顺便说一句:我认为这样的嵌套 ul 是无效的。

How about:

/html/body/ul/ul[count(descendant::li[contains(.,'dog_parks')]) > 0]/preceding-sibling::li

This should work with deeply nested lists. It always gets the upper-most category.

By the way: I don't think nesting ul's like this is valid.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文