Watir 在嵌套表上运行缓慢

发布于 2024-11-15 17:31:36 字数 682 浏览 1 评论 0原文

我正在使用 watir-webdriver 从具有基于嵌套表的布局的页面中进行抓取。例如,我在 http://veryslow.staticloud.com/ 构建了一个非常小的玩具网站。要搜索包含元素苏联和巴西的最里面的表,我使用以下代码:

require "rubygems"
require "watir-webdriver"
r = Watir::Browser.new
br.goto("http://veryslow.staticloud.com/")
reg = /USSR.+Brazil/m
mytable = br.table(:text,reg).table(:text,reg).table(:text,reg).table(:text,reg).table(:text, reg).table(:text, reg)
mytable.text

我有两个问题:

  1. 是否有更好的方法来搜索这些内部表?
  2. 为什么这么慢?要实际找到该表(当我调用 mytable.text 时完成),需要花费大量时间。对于具有基于嵌套表布局的复杂网站来说,这是非常长的。

我知道嵌套表设计是一个坏主意,但是如果您必须从中读取内容,是否有更快的方法来做到这一点?

I am using watir-webdriver to scrape from a page with nested table based layout. As an example, I constructed a very small toy site at http://veryslow.staticloud.com/. To search for the innermost table, that contains the elements USSR and Brazil, I use the following code:

require "rubygems"
require "watir-webdriver"
r = Watir::Browser.new
br.goto("http://veryslow.staticloud.com/")
reg = /USSR.+Brazil/m
mytable = br.table(:text,reg).table(:text,reg).table(:text,reg).table(:text,reg).table(:text, reg).table(:text, reg)
mytable.text

I have two questions:

  1. Is there a better way to search for these inner tables?
  2. Why is it so slow? To actually locate the table (done when I call mytable.text), it takes a substantial amount of time. For complex websites with nested table based layout, this is painfully long.

I know the nested table design is a bad idea, but if you have to read from them, is there a faster way to do that?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

世俗缘 2024-11-22 17:31:36

每当您使用 Regexp 来定位元素时,我们都需要在 Ruby 端而不是在浏览器本身中进行过滤。这意味着每次您在此处调用 .table(:text, reg) 时,我们都会找到包含元素内的所有表,并在 Ruby 中过滤该表以找到与 Regexp 匹配的表。这会很慢,尤其是像这样的页面。

Whenever you're using a Regexp to locate elements, we need to do the filtering on the Ruby side as opposed to in the browser itself. That means that for each time you call .table(:text, reg) here, we find all the tables inside the containing element, and filter through that in Ruby to find one that matches the Regexp. That's going to be slow, especially with a page like this.

月下客 2024-11-22 17:31:36

到目前为止,我已经发现 xpath 是处理已知页面结构的更好方法。所以,类似的事情

mytable = br.table(:xpath,"/html/body/table/tbody/tr[3]/td/table/tbody/tr[3]/td/table[2]/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td[3]/table")

通常要快得多。

So far I have been able to figure out that xpath is a better way to go about it for known page structures. So, something like

mytable = br.table(:xpath,"/html/body/table/tbody/tr[3]/td/table/tbody/tr[3]/td/table[2]/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td[3]/table")

is usually much faster.

∞琼窗梦回ˉ 2024-11-22 17:31:36

是否有机会让开发人员至少根据位置或其他内容为表、行或单元格分配名称或类?或者它在那个位置的功能?这将使事情变得更加可测试我应该认为这样你就可以做一些事情,比如寻找一个带有“originating_city”类和文本“New York”的单元格,等等。事实上,你有一个测试雷区,如果你可以的话如果没有任何开发人员的合作来使这个东西可以测试,我会认真地开始更新你的简历并在他们付诸东流之前寻找一个新的职位。

在您的具体示例中,您可以尝试使用 .parent,因为整个表格中只有一个单元格包含苏联。但这对于任何其他城市名称(例如巴西)来说效果不佳。

然后,我再次怀疑您当前的正则表达式驱动方法是否可以与该页面上可能的任何其他城市组合一起使用,其中该组合的某些部分不是唯一的。

Is there any chance to have the developers assign a name or class to the tables, rows, or cells at least based on position or something? or its function in that location? that would make things far more testable I should think That way you could do something like look for a cell with the class 'originating_city' and text "New York", etc. as it is, you have a testing minefield, and if you can't get any developer cooperation to make the thing testable, I'd seriously start updating your resume and looking for a new position before they go down in flames.

In your specific example, you might try making use of .parent since there is only one cell in the entire table with USSR in it.. but that would work poorly for any other city-name such as brazil.

Then again I doubt your current regular expression driven approach would work with any other city-combination that was possible on that page where some part of that combination was not unique.

平生欢 2024-11-22 17:31:36

如果要查找文本,您可以从顶部表格中读取整个文本并将其解析为您要查找的内容。这就是我减少表搜索大量开销的方法,直到我意识到我无法识别空表单元格,现在我必须以缓慢的方式做事,但如果您对空表单元格的位置不感兴趣,它可能会对您有所帮助显示的文本。

否则……不是真的。除非定义了从外表中排除的内部表(或其父/子),否则很难识别它。

If looking for the text you can read the entire text out of the top table and parse it to what you're looking for. This is how I reduced a lot of overhead in table searches until I realised I couldn't identify empty table cells, now I have to do things the slow way, but it may help you if you're not interested in the position of the displayed text.

Otherwise... not really. Unless something defined an inner table (or its parent/child) that is excluded from the outer table it's hard to identify it.

流年里的时光 2024-11-22 17:31:36

看看是否可以找到可以找到该表的任何属性。

mytable = br.table(:xpath,"/html/body/table/tbody/tr[3]/td/table/tbody/tr[3]/td/table[2]/tbody/tr/td/table/ tbody/tr/td[2]/table/tbody/tr[3]/td[3]/table")

如果你这样写,会工作得更好并且不那么脆弱
mytable = br.table(:xpath,"//table[@name='sometablename']")

有时 UI 元素往往具有动态 id,该 id 在每次页面刷新时都会发生变化,例如 id='xyz12345' 更改为 id='刷新后 abc475843'。在这种情况下,您可以通过使用 Nokogiri 或 Hpricot 解析 br.html 来提高速度(不过 Nokogiri 优于 Hpricot)。

See if you can find any attributes that the table can be found by.

mytable = br.table(:xpath,"/html/body/table/tbody/tr[3]/td/table/tbody/tr[3]/td/table[2]/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td[3]/table")

Will work much better and be less brittle if you write it like
mytable = br.table(:xpath,"//table[@name='sometablename']")

Sometimes UI elements tend to have dynamic id's that change upon every page refresh like for instance id='xyz12345' changes to id='abc475843' upon refresh. In this case, you can gain speed by parsing the br.html using Nokogiri or Hpricot (Nokogiri is prefered of Hpricot though).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文