使用 Nokogiri 和 Mechanize 解析 html 表
使用以下代码,我尝试从我们的电话提供商的 Web 应用程序中抓取呼叫日志,以将信息输入到我的 Ruby on Rails 应用程序中。
desc "Import incoming calls"
task :fetch_incomingcalls => :environment do
# Logs into manage.phoneprovider.co.uk and retrieved list of incoming calls.
require 'rubygems'
require 'mechanize'
require 'logger'
# Create a new mechanize object
agent = Mechanize.new { |a| a.log = Logger.new(STDERR) }
# Load the Phone Provider website
page = agent.get("https://manage.phoneprovider.co.uk/login")
# Select the first form
form = agent.page.forms.first
form.username = 'username
form.password = 'password
# Submit the form
page = form.submit form.buttons.first
# Click on link called Call Logs
page = agent.page.link_with(:text => "Call Logs").click
# Click on link called Incoming Calls
page = agent.page.link_with(:text => "Incoming Calls").click
# Prints out table rows
# puts doc.css('table > tr')
# Print out the body as a test
# puts page.body
end
正如您从最后五行中看到的,我已经测试了“puts page.body”成功运行并且上面的代码有效。它成功登录,然后导航到“呼叫日志”,然后导航到“来电”。来电表如下所示:
| Timestamp | Source | Destination | Duration |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
由以下代码生成:
<thead>
<tr>
<td>Timestamp</td>
<td>Source</td>
<td>Destination</td>
<td>Duration</td>
<td>Cost</td>
<td class='centre'>Recording</td>
</tr>
</thead>
<tbody>
<tr class='o'>
<tr>
<td>03 Jan 13:40</td>
<td>12345678</td>
<td>12345679</td>
<td>00:01:14</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>30 Dec 20:31</td>
<td>12345678</td>
<td>12345679</td>
<td>00:02:52</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>24 Dec 00:03</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:09</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>23 Dec 14:56</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:07</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>21 Dec 13:26</td>
<td>07793770851</td>
<td>12345679</td>
<td>00:00:26</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
我正在尝试弄清楚如何仅选择我想要的单元格(时间戳、源) 、目的地和持续时间)并输出它们。然后我可以担心将它们输出到数据库而不是终端中。
我尝试过使用选择器小工具,但如果我选择多个,它只会显示“td”或“tr:nth-child(6)td,tr:nth-child(2)td”。
任何帮助或指示将不胜感激!
Using the following code I am trying to scrape a call log from our phone provider's web application to enter the info into my Ruby on Rails application.
desc "Import incoming calls"
task :fetch_incomingcalls => :environment do
# Logs into manage.phoneprovider.co.uk and retrieved list of incoming calls.
require 'rubygems'
require 'mechanize'
require 'logger'
# Create a new mechanize object
agent = Mechanize.new { |a| a.log = Logger.new(STDERR) }
# Load the Phone Provider website
page = agent.get("https://manage.phoneprovider.co.uk/login")
# Select the first form
form = agent.page.forms.first
form.username = 'username
form.password = 'password
# Submit the form
page = form.submit form.buttons.first
# Click on link called Call Logs
page = agent.page.link_with(:text => "Call Logs").click
# Click on link called Incoming Calls
page = agent.page.link_with(:text => "Incoming Calls").click
# Prints out table rows
# puts doc.css('table > tr')
# Print out the body as a test
# puts page.body
end
As you can see from the last five lines, I have tested that the 'puts page.body' works successfully and the above code works. It successfully logs in and then navigates to Call Logs followed by Incoming Calls.The incoming call table looks like this:
| Timestamp | Source | Destination | Duration |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
| 03 Jan 13:40 | 12345678 | 12345679 | 00:01:01 |
Which is generated from the following code:
<thead>
<tr>
<td>Timestamp</td>
<td>Source</td>
<td>Destination</td>
<td>Duration</td>
<td>Cost</td>
<td class='centre'>Recording</td>
</tr>
</thead>
<tbody>
<tr class='o'>
<tr>
<td>03 Jan 13:40</td>
<td>12345678</td>
<td>12345679</td>
<td>00:01:14</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>30 Dec 20:31</td>
<td>12345678</td>
<td>12345679</td>
<td>00:02:52</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>24 Dec 00:03</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:09</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>23 Dec 14:56</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:07</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>21 Dec 13:26</td>
<td>07793770851</td>
<td>12345679</td>
<td>00:00:26</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
I'm trying to work out how to selects just the cells I want (Timestamp, Source, Destination and Duration) and output those. I can then worry about outputting them to the database rather than in Terminal.
I have tried using Selector Gadget but it just show either 'td' or 'tr:nth-child(6) td , tr:nth-child(2) td' if I select multiple.
Any help or pointers would be appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
表中有一个使用 XPath 可以轻松利用的模式。包含所需信息的行的
标记缺少
class
属性。幸运的是,XPath 提供了一些简单的逻辑运算,包括not()
。这正好提供了我们需要的功能。一旦减少了要处理的行数,我们就可以使用 XPath 的
element[n]
选择器迭代行并提取必要列的文本。这里需要注意的一个重要事项是,XPath 从 1 开始计算元素,因此表行的第一列将为 td[1]。使用 Nokogiri(和规格)的示例代码:
There is a pattern in the table that is easy to leverage using XPath. The
<tr>
tag of rows with the required information lack theclass
attribute. Fortunately, XPath provides some simple logical operations includingnot()
. This provides just the functionality we need.Once we've reduced the number of rows we're dealing with, we can iterate over the rows and extract the text of the necessary columns by using XPath's
element[n]
selector. One important note here is that XPath counts elements starting from 1, so the first column of a table row would betd[1]
.Example code using Nokogiri (and specs):
您的答案就在这个railscasts
http://railscasts.com/episodes/190- screen-scraping-with-nokogiri
这也有帮助
如何使用 Nokogiri 解析 HTML 表格?
Your answer lies in this railscasts
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
This too can help
How do I parse an HTML table with Nokogiri?
您应该能够使用 XPath 选择器从根(最坏的情况)到达所需的确切节点。 此处列出了将 XPath 与 Nokogiri 结合使用。
有关如何使用 XPath 访问所有元素的详细信息,请查看此处。
You should be able to reach the exact node you required from the root (worst case) using XPath selectors. Using XPath with Nokogiri is listed here.
For detail on how reach all your elements using XPath, look here.