Ruby Mechanize 屏幕抓取帮助

发布于 2024-10-17 12:41:59 字数 486 浏览 2 评论 0原文

我正在尝试在表中抓取一行并包含日期。我只想抓取有今天日期的第三行。

这是我的机械化代码。我正在尝试选择具有今天日期及其列的列行：

agent.page.search("//td").map(&:text).map(&:strip)

Output:
"11-02-2011", "1", "1", "1", "1", "0", "0,00 DKK", "0,00", "0,00 DKK", 
"12-02-2011", "5", "5", "1", "4", "0", "0,00 DKK", "0,00", "0,00 DKK", 
"14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK",
"7", "9", "3", "6", "0", "0,00 DKK", "0,00", "0,00 DKK

我想要只刮第三行，即今天的日期。

原文

I am trying to scrape a row in a table with a date. I want to scrape only the third row that have the date today.

This is my mechanize code. I am trying to select the colum row witch have the date today and its and its columns:

agent.page.search("//td").map(&:text).map(&:strip)

Output:
"11-02-2011", "1", "1", "1", "1", "0", "0,00 DKK", "0,00", "0,00 DKK", 
"12-02-2011", "5", "5", "1", "4", "0", "0,00 DKK", "0,00", "0,00 DKK", 
"14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK",
"7", "9", "3", "6", "0", "0,00 DKK", "0,00", "0,00 DKK

I want to only scrape the third row that is the date today.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

戈亓 2024-10-24 12:41:59

不要使用 '//td' 循环遍历标记，而是搜索标记，仅获取第三个，然后循环'//td'。

Mechanize 在内部使用 Nokogiri，因此使用 Nokogiri 语的操作方法如下：

html = <<EOT
<table>
<tr><td>11-02-2011</td><td>1</td><td>1</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>12-02-2011</td><td>5</td><td>5</td><td>1</td><td>4</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>14-02-2011</td><td>1</td><td>3</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>,00</td><td>0,00 DKK</td></tr>
</table>
EOT

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML(html)

pp doc.search('//tr')[2].search('td').map{ |n| n.text }

>> ["14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK"]

使用 .search('//tr')[2].search('td').map{ |n| n.text } 附加到 Mechanize 的 agent.page 中，如下所示：

agent.page.search('//tr')[2].search('td').map{ |n| n.text }

自从我玩 Mechanize 以来已经有一段时间了，所以它也可能是 agent.page.parser.. .。

编辑：

表中将会出现更多行。我想要抓取的行始终是倒数第二行。

将这些信息放入您原来的问题中非常重要。您的问题越准确，我们的答案就越准确。

Rather than loop over the <td> tags using '//td', search for the <tr> tags, grab only the third one, then loop over '//td'.

Mechanize uses Nokogiri internally, so here's how to do it in Nokogiri-ese:

html = <<EOT
<table>
<tr><td>11-02-2011</td><td>1</td><td>1</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>12-02-2011</td><td>5</td><td>5</td><td>1</td><td>4</td><td>0</td><td>0,00 DKK</td><td>0,00</td><td>0,00 DKK</td></tr>
<tr><td>14-02-2011</td><td>1</td><td>3</td><td>1</td><td>1</td><td>0</td><td>0,00 DKK</td><td>,00</td><td>0,00 DKK</td></tr>
</table>
EOT

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML(html)

pp doc.search('//tr')[2].search('td').map{ |n| n.text }

>> ["14-02-2011", "1", "3", "1", "1", "0", "0,00 DKK", ",00", "0,00 DKK"]

Use the .search('//tr')[2].search('td').map{ |n| n.text } appended to Mechanize's agent.page, like so:

agent.page.search('//tr')[2].search('td').map{ |n| n.text }

It's been a while since I played with Mechanize, so it might also be agent.page.parser....

EDIT:

there will come more rows in the table. The row that i want to scrape is always the second last.

It's important to put that information into your original question. The more accurate your question, the more accurate our answers.

回复收藏 0 原文

~没有更多了~

关于作者

我最亲爱的

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

Ruby Mechanize 屏幕抓取帮助

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

烙印

singlesman

给自己一个微笑

独孤求败

晨钟暮鼓

我是自愿种绣球花的

友情链接

Ruby Mechanize 屏幕抓取帮助

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

烙印

singlesman

给自己一个微笑

独孤求败

晨钟暮鼓

我是自愿种绣球花的

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。