Ruby Mechanize 表抓取无法捕获整行

发布于 2024-10-17 22:10:37 字数 2567 浏览 12 评论 0原文

我正在尝试用 mechanize 抓取一个表格网站。我想刮第二行。

当我运行：

agent.page.search('table.ea').search('tr')[-2].search('td').map{ |n| n.text }

我希望它能刮掉整行。但它只抓取： ["2011-02-17", "0,00"]

为什么它不抓取行中的所有列，而只抓取第一列和最后一列？

X路径： /html/body/center/table/tbody/tr[2]/td[2]/table/tbody/tr[3]/td/table/tbody/tr[2]/td/table/tbody/tr[2 ]

CSS 路径： html body center table tbody tr td table tbody tr td table tbody tr td table.ea tbody tr td.total

页面类似如下：

<table><table><table>
<table width="100%" border="0" cellpadding="0" cellspacing="1" class="ea">
<tr>
    <th><a href="#">Date</a></th>
    <th><a href="#">One</a></th>    
    <th><a href="#">Two</a></th>    
    <th><a href="#">Three</a></th>     
    <th><a href="#">Four</a></th>    
    <th><a href="#">Five</a></th>        
    <th><a href="#">Six</a></th>        
    <th><a href="#">Seven</a></th>      
    <th><a href="#">Eight</a></th>
</tr>
<tr>
    <td><a href="#">2011-02-17</a></td>
    <td align="right">0</td>    
    <td align="right">0</td>    
    <td align="right">0,00</td>     
    <td align="right">0</td>    
    <td align="right">0</td>        
    <td align="right">0</td>    
    <td align="right">0</td>        
    <td align="right">387</td>      
    <td align="right">0,00</td>     <!-- FOV -->
    <td align="right">0,00</td>
</tr>
<tr>
    <td class="total">Ialt</td>
    <td class="total" align="right">0</td>  
    <td class="total" align="right">40</td>     
    <td class="total" align="right">0,46</td>   
    <td class="total" align="right">2</td>      
    <td class="total" align="right">0</td>        
    <td class="total" align="right">0</td>      
    <td class="total" align="right">0</td>        
    <td class="total" align="right">3.060</td>      
    <td class="total" align="right">0,00</td>       
    <td class="total" align="right">18,58</td>
</tr>
</table>
</table></table></table>

原文

I am trying to scrape a table website with mechanize.
I want to scrape the second row.

When I run :

agent.page.search('table.ea').search('tr')[-2].search('td').map{ |n| n.text }

I would expect it to scrape the whole row. But instead it only scrapes: ["2011-02-17", "0,00"]

Why isn't it scraping all of the columns in the row, but just the first and the last column?

Xpath:
/html/body/center/table/tbody/tr[2]/td[2]/table/tbody/tr[3]/td/table/tbody/tr[2]/td/table/tbody/tr[2]

CSS PATH:
html body center table tbody tr td table tbody tr td table tbody tr td table.ea tbody tr td.total

The page is similar to this:

<table><table><table>
<table width="100%" border="0" cellpadding="0" cellspacing="1" class="ea">
<tr>
    <th><a href="#">Date</a></th>
    <th><a href="#">One</a></th>    
    <th><a href="#">Two</a></th>    
    <th><a href="#">Three</a></th>     
    <th><a href="#">Four</a></th>    
    <th><a href="#">Five</a></th>        
    <th><a href="#">Six</a></th>        
    <th><a href="#">Seven</a></th>      
    <th><a href="#">Eight</a></th>
</tr>
<tr>
    <td><a href="#">2011-02-17</a></td>
    <td align="right">0</td>    
    <td align="right">0</td>    
    <td align="right">0,00</td>     
    <td align="right">0</td>    
    <td align="right">0</td>        
    <td align="right">0</td>    
    <td align="right">0</td>        
    <td align="right">387</td>      
    <td align="right">0,00</td>     <!-- FOV -->
    <td align="right">0,00</td>
</tr>
<tr>
    <td class="total">Ialt</td>
    <td class="total" align="right">0</td>  
    <td class="total" align="right">40</td>     
    <td class="total" align="right">0,46</td>   
    <td class="total" align="right">2</td>      
    <td class="total" align="right">0</td>        
    <td class="total" align="right">0</td>      
    <td class="total" align="right">0</td>        
    <td class="total" align="right">3.060</td>      
    <td class="total" align="right">0,00</td>       
    <td class="total" align="right">18,58</td>
</tr>
</table>
</table></table></table>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

黎歌 2024-10-24 22:10:37

使用以下 Ruby 代码 (https://gist.github.com/835603)：

require 'mechanize'
require 'pp'

a = Mechanize.new { |agent|
  agent.user_agent_alias = 'Mac Safari'
}

a.get('http://binarymuse.net/table.html') do |page|
  pp page.search('table.ea').search('tr')[-2].search('td').map{ |n| n.text }
end

我得到以下内容输出：

["2011-02-17", "0", "0", "0,00", "0", "0", "0", "0", "387", "0,00", "0,00"]

Using the following Ruby code (https://gist.github.com/835603):

require 'mechanize'
require 'pp'

a = Mechanize.new { |agent|
  agent.user_agent_alias = 'Mac Safari'
}

a.get('http://binarymuse.net/table.html') do |page|
  pp page.search('table.ea').search('tr')[-2].search('td').map{ |n| n.text }
end

I get the following output:

["2011-02-17", "0", "0", "0,00", "0", "0", "0", "0", "387", "0,00", "0,00"]

回复收藏 0 原文

蓝眸 2024-10-24 22:10:37

我建议你把 Mechanize 留给比刮一页更难的事情。
您可以使用 Nokogiri 比使用 Mechanize 更简单（但当然您可以用它来做），因为您只需 < a href="http://nokogiri.org/tutorials/searching_a_xml_html_document.html" rel="nofollow noreferrer">查询页面。

尝试一下！

这里是关于 nokogiri 的答案的链接就

我个人而言，当我需要发送表格和类似的东西时，我使用了 Mechanize 尽管它还有很多其他用途！