当前位置：文江博客话题详情

SeleniumRC CSS 定位器可能比 XPath 慢的原因？

发布于 2024-11-01 09:55:29 字数 2329 浏览 4 评论 0原文

我有一些代码可以使用 SeleniumRC 进行模拟递归树遍历以从 HTML 树中抓取内容。我使用 Xpath 和 CSS 定位器运行代码。

该树被表示为一系列嵌套表。如果这很重要的话，一些树内容开始时不可见，因为分支“折叠”了。对于 Xpath 和 CSS，树在可见与不可见方面处于相同的状态。

为了获取节点值，我的代码以“根”表达式开始，添加可以为每个连续的同级节点递增的“分支”标记，然后使用“节点”标记来获取文本内容。

这一切都有效，但使用我想出的 CSS 表达式要慢得多。

我认为这是一种制作定位器表达式的笨拙方法，尽管它适合我的目的。我只是想弄清楚如何最好地使用 CSS 来接近使用 Xpath 的时代。

该循环测试了许多无效表达式（不断寻找第 n 个同级表达式，直到找不到），并且由于我逐渐深入嵌套表的方式，表达式变得非常长。

下面是来自递归的表达式和示例。如果有人能提供一些关于我正在做的事情使 CSS 比 Xpath 花费更长的时间的见解，那将非常有帮助。

我在处理这种 HTML 内容方面完全是个新手，如果您发现我从 Xpath 转向 CSS 的方式有些愚蠢，请说出来。

XPath“标记”：

final String rootbase = "//*[contains(@id,\"treeBox\")]/div";
// in next string, "{branchIncrement}" will be replaced with integer values from 2 to get to text content, and skip graphical elements
final String leveltoken = "/table/tbody/tr[{branchIncrement}]/td[2]";
final String nodetoken = "/table/tbody/tr/td[4]/span";

CSS“标记”：

final String rootbase = "css=[id*=treeBox]>div";
// in next string, "{branchIncrement}" will be replaced with integer values from 2 to get to text content, and skip graphical elements
final String leveltoken = ">table>tbody>tr:nth-child({branchIncrement})>td:nth-child(2)";
final String nodetoken = ">table>tbody>tr>td:nth-child(4)>span";

“根”处内容的第一个 XPath 表达式为：

//*[contains(@id,"treeBox")]/div/table/tbody/tr[2]/td[2]/table/tbody/tr/td[4]/span

40 节点树的最后一个 XPath 表达式，有四层，根以下每层三个兄弟节点 (1+3+3x3 +3x3x3) 是：

//*[contains(@id,"treeBox")]/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[3]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr/td[4]/span

第一个 CSS 表达式是：

[id*=treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span

最后一个 CSS 表达式是：

[id*=treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(3)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span

原文

I've got some code that does a simulated recursion tree walk to scrape stuff from an HTML tree using SeleniumRC. I've run the code using both Xpath and CSS locators.

The tree is represented as a series of nested tables. If it matters at all, some of the tree content starts out not visible as branches are "collapsed". For both Xpath and CSS, the tree is in the same state in terms of visible vs. not visible.

To get node values, my code starts with a "root" expression, adds "branch" tokens that can be incremented for each successive sibling node, and then uses a "node" token to get the text content.

It all works, but much slower using the CSS expressions I've come up with.

I suppose it is a kludgy way to make locator expressions, although it works for my purposes. I'm just trying to figure out how to best use CSS to get closer to the times involved using Xpath.

The loop tests many invalid expressions (keeps looking for nth sibling until not found) and the expressions get really long, due to the way I am incrementally drilling further and further into nested tables.

Below follows the bits of expression and examples that come from the recursion. If anyone can provide some insight as to what I am doing that is making CSS take so much longer than Xpath, that would be very helpful.

I am a total newb at doing this kind of manipulation of HTML content, if you see something dumb in terms of how I've moved from Xpath to CSS, please say so.

XPath “tokens”:

final String rootbase = "//*[contains(@id,\"treeBox\")]/div";
// in next string, "{branchIncrement}" will be replaced with integer values from 2 to get to text content, and skip graphical elements
final String leveltoken = "/table/tbody/tr[{branchIncrement}]/td[2]";
final String nodetoken = "/table/tbody/tr/td[4]/span";

CSS “tokens”:

final String rootbase = "css=[id*=treeBox]>div";
// in next string, "{branchIncrement}" will be replaced with integer values from 2 to get to text content, and skip graphical elements
final String leveltoken = ">table>tbody>tr:nth-child({branchIncrement})>td:nth-child(2)";
final String nodetoken = ">table>tbody>tr>td:nth-child(4)>span";

The first XPath expression for the content at the "root" is:

//*[contains(@id,"treeBox")]/div/table/tbody/tr[2]/td[2]/table/tbody/tr/td[4]/span

The last XPath expression for a 40 node tree with four levels, three sibling each level below the root (1+3+3x3+3x3x3) is:

//*[contains(@id,"treeBox")]/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[3]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr/td[4]/span

The first CSS expression is:

[id*=treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span

The last CSS expression is:

[id*=treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(3)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

习惯成性 2024-11-08 09:55:29

在 Firefox 中，Selenium RC 的 XPath 定位器由浏览器的本机 XPath 引擎处理，CSS 定位器由 JavaScript 库处理 (Dean Edwards 的 cssQuery.js）。后来的 Selenium 版本（例如，2.0b* 系列）使用 jQuery 的 CSS sizzle 库，但他们仍然用 JavaScript 来做。除了隐含的速度差异之外，您还在根表达式中进行模式匹配（ie、[id*=treeBox），这需要枚举整个 DOM甚至在您从树上下来之前就可以找到匹配项。想想如何用纯 JavaScript 编写它，您就会开始发现问题。

如果这让您感觉更好的话，IE 仍然没有原生 XPath 实现，因此 Selenium 使用该浏览器中的几种 JavaScript 实现之一，它的速度是 Firefox 3.6 中 XPath 的二分之一到十分之一，因为的。

长话短说，在这种特殊情况下，你无法做太多事情来使 CSS 定位器更快。

回复收藏 0 原文

沫离伤花 2024-11-08 09:55:29

通常，这不是你能帮忙的。 Selenium 中的 XPath 选择器机制利用浏览器的 XPath 工具。甚至 IE6 也有其中之一。我不知道哪个浏览器通过 JavaScript 提供 CSS 选择器工具，因此 Selenium 必须使用自己的代码。由于他们的代码都是 JavaScript，并且内部浏览器 XPath 解析通常是在本机代码中完成的，因此速度要慢得多（尤其是在 IE6 中）。

回复收藏 0 原文

狼性发作 2024-11-08 09:55:29

感谢您的反馈。阅读您的注释后，我想知道是否可以通过使用一小段代码来解析文字 Id 值以替换重复使用的 contains 表达式来获得实质性改进。

这是我用于同一件事的四个不同的定位器。一对定位器是 XPath，两个是 CSS。对于每一对，一个使用 contains 表达式，一个首先解析为文字。在每种情况下，示例定位器均针对三层 1307 节点树的最后一个节点。

XPath with contains:

//*[contains(@id,"treeBox")]/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[26]/td[2]/table/tbody/tr/td[4]/span

XPath whereliteral Replaces contains expression:

id('ns_7_5R4GAB1A0GKQ50IQJQR7VV10M6__treeBox')/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[24]/td[2]/table/tbody/tr/td[4]/span

CSS with contains:

css=[id*=treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(24)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span

CSS whereliteral Replaces contains expression:

css=[id=ns_7_5R4GAB1A0GKQ50IQJQR7VV10M6__treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(24)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span

使用两棵不同大小的树，一棵有 102 个节点，另一棵有 1307 个节点，我发现了以下内容。

102个节点：
|包含 |文字|
XPath | 15秒。 | 13秒。 |
CSS | 19秒。 | 19秒。 |

1307 个节点：
|包含|文字|
XPath | 255 秒| 145 秒|
CSS | 1893 秒| 1811 秒。|

显然，本机实现（Firefox 上的 XPath 和 Se-RC）比 JScript 实现快得多。代价是它可能无法跨浏览器正常工作。

Thanks for that feedback. After reading your note, I wondered if I could get substancial improvment by using a tiny bit of code to resolve a literal Id value to replace the contains expression used repeatedly.

Here are four different locators I've used for the same thing. A pair of the locators are XPath, and two are CSS. For each of those pairs, one uses a contains expression, and one resolves to a literal first. In each case, the example locator are for the last node of a three level 1307 node tree.

XPath with contains:

//*[contains(@id,"treeBox")]/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[26]/td[2]/table/tbody/tr/td[4]/span

XPath where literal replaces contains expression:

id('ns_7_5R4GAB1A0GKQ50IQJQR7VV10M6__treeBox')/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[24]/td[2]/table/tbody/tr/td[4]/span

CSS with contains:

css=[id*=treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(24)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span

CSS where literal replaces contains expression:

css=[id=ns_7_5R4GAB1A0GKQ50IQJQR7VV10M6__treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(24)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span

Working with two different sized trees, one 102 nodes, the other 1307 nodes, I found the following.

102 nodes:
| contains | literal |
XPath | 15 sec. | 13 sec. |
CSS | 19 sec. | 19 sec. |

1307 nodes:
| contains | literal |
XPath | 255 sec. | 145 sec.|
CSS | 1893 sec. | 1811 sec.|

Clearly, a native implementation (XPath on Firefox with Se-RC) is much faster than a JScript implementation. The trade off is that it might not work as well across browsers.

回复收藏 0 原文

~没有更多了~