Ruby Mechanize 屏幕抓取帮助

发布于 2024-10-17 12:41:59 字数 486 浏览 2 评论 0原文

我正在尝试在表中抓取一行并包含日期。我只想抓取有今天日期的第三行。

这是我的机械化代码。我正在尝试选择具有今天日期及其列的列行:

agent.page.search("//td").map(&:text).map(&:strip)

Output:
"11-02-2011", "1", "1", "1", "1", "0", "0,00 DKK", "0,00", "0,00 DKK", 
"12-02-2011", "5", "5", "1", "4", "0", "0,00 DKK", "0,00", "0,00 DKK", 
"14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK",
"7", "9", "3", "6", "0", "0,00 DKK", "0,00", "0,00 DKK

"

我想要只刮第三行,即今天的日期。

I am trying to scrape a row in a table with a date. I want to scrape only the third row that have the date today.

This is my mechanize code. I am trying to select the colum row witch have the date today and its and its columns:

agent.page.search("//td").map(&:text).map(&:strip)

Output:
"11-02-2011", "1", "1", "1", "1", "0", "0,00 DKK", "0,00", "0,00 DKK", 
"12-02-2011", "5", "5", "1", "4", "0", "0,00 DKK", "0,00", "0,00 DKK", 
"14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK",
"7", "9", "3", "6", "0", "0,00 DKK", "0,00", "0,00 DKK

"

I want to only scrape the third row that is the date today.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

戈亓 2024-10-24 12:41:59

不要使用 '//td' 循环遍历 标记,而是搜索 标记,仅获取第三个,然后循环'//td'

Mechanize 在内部使用 Nokogiri,因此使用 Nokogiri 语的操作方法如下:

html = <<EOT
<table>
<tr><td>11-02-2011</td><td>1</td><td>1</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>12-02-2011</td><td>5</td><td>5</td><td>1</td><td>4</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>14-02-2011</td><td>1</td><td>3</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>,00</td><td>0,00 DKK</td></tr>
</table>
EOT

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML(html)

pp doc.search('//tr')[2].search('td').map{ |n| n.text }

>> ["14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK"]

使用 .search('//tr')[2].search('td').map{ |n| n.text } 附加到 Mechanize 的 agent.page 中,如下所示:

agent.page.search('//tr')[2].search('td').map{ |n| n.text }

自从我玩 Mechanize 以来已经有一段时间了,所以它也可能是 agent.page.parser.. .


编辑:

表中将会出现更多行。我想要抓取的行始终是倒数第二行。

将这些信息放入您原来的问题中非常重要。您的问题越准确,我们的答案就越准确。

Rather than loop over the <td> tags using '//td', search for the <tr> tags, grab only the third one, then loop over '//td'.

Mechanize uses Nokogiri internally, so here's how to do it in Nokogiri-ese:

html = <<EOT
<table>
<tr><td>11-02-2011</td><td>1</td><td>1</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>12-02-2011</td><td>5</td><td>5</td><td>1</td><td>4</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>14-02-2011</td><td>1</td><td>3</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>,00</td><td>0,00 DKK</td></tr>
</table>
EOT

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML(html)

pp doc.search('//tr')[2].search('td').map{ |n| n.text }

>> ["14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK"]

Use the .search('//tr')[2].search('td').map{ |n| n.text } appended to Mechanize's agent.page, like so:

agent.page.search('//tr')[2].search('td').map{ |n| n.text }

It's been a while since I played with Mechanize, so it might also be agent.page.parser....


EDIT:

there will come more rows in the table. The row that i want to scrape is always the second last.

It's important to put that information into your original question. The more accurate your question, the more accurate our answers.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文