这个xpath查询(PHP)如何更灵活?
我正在使用 PHP 的 SimpleXML 解析 XHTML 文档。我需要在文档中查询一系列 ul 来查找包含特定值的节点,然后找到该节点的父节点的直接前一个兄弟节点...代码将有助于解释!
给定以下虚拟 xhtml:
<html>
<head></head>
<body>
...
<ul class="attr-list">
<li>Active Life (active)</li>
<ul>
<li>Amateur Sports Teams (amateursportsteams)</li>
<li>Amusement Parks (amusementparks)</li>
<li>Fitness & Instruction (fitness)</li>
<ul>
<li>Dance Studios (dancestudio)</li>
<li>Gyms (gyms)</li>
<li>Martial Arts (martialarts)</li>
<li>Pilates (pilates)</li>
<li>Swimming Lessons/Schools (swimminglessons)</li>
</ul>
<li>Go Karts (gokarts)</li>
<li>Mini Golf (mini_golf)</li>
<li>Parks (parks)</li>
<ul>
<li>Dog Parks (dog_parks)</li>
<li>Skate Parks (skate_parks)</li>
</ul>
<li>Playgrounds (playgrounds)</li>
<li>Rafting/Kayaking (rafting)</li>
<li>Tennis (tennis)</li>
<li>Zoos (zoos)</li>
</ul>
<li>Arts & Entertainment (arts)</li>
<ul>
<li>Arcades (arcades)</li>
<li>Art Galleries (galleries)</li>
<li>Wineries (wineries)</li>
</ul>
<li>Automotive (auto)</li>
<ul>
<li>Auto Detailing (auto_detailing)</li>
<li>Auto Glass Services (autoglass)</li>
<li>Auto Parts & Supplies (autopartssupplies)</li>
</ul>
<li>Nightlife (nightlife)</li>
<ul>
<li>Bars (bars)</li>
<ul>
<li>Dive Bars (divebars)</li>
</ul>
</ul>
</ul>
...
</body>
</html>
我需要能够查询 ul.attr-list 中的子元素,并发现其“根”类别。我无法更改 xhtml 以不同的方式形成。
所以,如果我有“画廊”作为一个类别,我需要知道它属于“艺术”“根”类别。或者,如果我有“dog_parks”,我需要知道它属于“活动”类别。下面的代码完成了这项工作,但前提是最多有两个嵌套级别:
function get_root_category($shortCategoryName){
$url = "http://www.yelp.com/developers/documentation/category_list";
$result = file_get_contents($url);
$dom = new domDocument();
@$dom->loadHTML($result);
$dom->preserveWhiteSpace = false;
$sxml = simplexml_import_dom($dom);
$lvl1 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li");
$lvl2 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li/parent::ul/preceding-sibling::li");
if($lvl2){
return array_pop($lvl2);
} else {
return array_pop($lvl1);
}
}
必须有一种更好的方法来编写该 XPath,以便只需要进行一个查询,并且对于多个查询相对来说是无懈可击的,嵌套级别。
编辑:: 感谢那些指出此 HTML 无效的人。但是,页面结构已定,无法编辑;我只能把它当作一种资源来使用,并且必须按其本来面目去对待它。
I'm parsing an XHTML document using PHP's SimpleXML. I need to query a series of ul's in the document for a node containing a specific value, then find that node's parent's direct previous sibling... code will help explain!
Given the following dummy xhtml:
<html>
<head></head>
<body>
...
<ul class="attr-list">
<li>Active Life (active)</li>
<ul>
<li>Amateur Sports Teams (amateursportsteams)</li>
<li>Amusement Parks (amusementparks)</li>
<li>Fitness & Instruction (fitness)</li>
<ul>
<li>Dance Studios (dancestudio)</li>
<li>Gyms (gyms)</li>
<li>Martial Arts (martialarts)</li>
<li>Pilates (pilates)</li>
<li>Swimming Lessons/Schools (swimminglessons)</li>
</ul>
<li>Go Karts (gokarts)</li>
<li>Mini Golf (mini_golf)</li>
<li>Parks (parks)</li>
<ul>
<li>Dog Parks (dog_parks)</li>
<li>Skate Parks (skate_parks)</li>
</ul>
<li>Playgrounds (playgrounds)</li>
<li>Rafting/Kayaking (rafting)</li>
<li>Tennis (tennis)</li>
<li>Zoos (zoos)</li>
</ul>
<li>Arts & Entertainment (arts)</li>
<ul>
<li>Arcades (arcades)</li>
<li>Art Galleries (galleries)</li>
<li>Wineries (wineries)</li>
</ul>
<li>Automotive (auto)</li>
<ul>
<li>Auto Detailing (auto_detailing)</li>
<li>Auto Glass Services (autoglass)</li>
<li>Auto Parts & Supplies (autopartssupplies)</li>
</ul>
<li>Nightlife (nightlife)</li>
<ul>
<li>Bars (bars)</li>
<ul>
<li>Dive Bars (divebars)</li>
</ul>
</ul>
</ul>
...
</body>
</html>
I need to be able to query the ul.attr-list for a child element, and discover its "root" category. I cannot change the xhtml to be formed differently.
So, if I have "galleries" as a category, I need to know that it is in the "arts" "root" category. Or, if I have "dog_parks", I need to know that it is in the "active" category. The following code gets the job done, but only with the assumption that at max there are two nested levels:
function get_root_category($shortCategoryName){
$url = "http://www.yelp.com/developers/documentation/category_list";
$result = file_get_contents($url);
$dom = new domDocument();
@$dom->loadHTML($result);
$dom->preserveWhiteSpace = false;
$sxml = simplexml_import_dom($dom);
$lvl1 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li");
$lvl2 = $sxml->xpath("//li[contains(., '".$shortCategoryName."')]/parent::ul/preceding-sibling::li/parent::ul/preceding-sibling::li");
if($lvl2){
return array_pop($lvl2);
} else {
return array_pop($lvl1);
}
}
There has to be a better way to write that XPath, so that only one query needs to be made, and is relatively bulletproof to multiple, nested levels.
EDIT:: Thanks to those that pointed out that this HTML is not valid. However, the structure of the page is set, and I cannot edit it; I can only use it as a resource, and have to make due with what it is.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
那就是(这里
$v
是您要查找的值):$v
不包含单引号,因为这会破坏 XPath 表达式。[包含(concat(' ', text(), ' '), concat(' ', '$v', ' '))]
。...
缩写了完整的字母表):[包含(translate(text(), 'ABC…XYZ', 'abc…xyz'), '{strtolower($v)}')]
。text()
确保仅考虑直接子文本节点。当您使用.
时,的整个“子树”将转换为字符串,您可能会得到比实际想要的更多的结果。
//
运算符(后代
轴的快捷方式)限制为树的特定部分 - 如果您可以进一步限制它,请务必这样做.让 XPath 以
//
开头会比实际需要的速度慢得多,因为整个文档的所有节点都会被检查,即使是那些在任何情况下都不能生成匹配。That would be (here
$v
is the value you look for):$v
does not contain single quotes, since this would break the XPath expression.[contains(concat(' ', text(), ' '), concat(' ', '$v', ' '))]
.…
):[contains(translate(text(), 'ABC…XYZ', 'abc…xyz'), '{strtolower($v)}')]
.text()
ensures only direct child text nodes are taken into account. When you use.
instead, the whole "subtree" of the<li>
is converted to string and you might get more results than you actually want.//
operator (a shortcut for thedescendant
axis) to a certain part of the tree - if you can restrict it further, by all means do so.Letting your XPath start with
//
makes it much slower than it needs to be since all nodes of the entire document are checked, even those that can not under any circumstances produce a match.怎么样:
/html/body/ul/ul[count(descendant::li[contains(.,'dog_parks')]) > 0]/preceding-sibling::li
这应该适用于深度嵌套的列表。它始终获得最上面的类别。
顺便说一句:我认为这样的嵌套
ul
是无效的。How about:
/html/body/ul/ul[count(descendant::li[contains(.,'dog_parks')]) > 0]/preceding-sibling::li
This should work with deeply nested lists. It always gets the upper-most category.
By the way: I don't think nesting
ul
's like this is valid.