使用 Nokogiri 和 Mechanize 解析 html 表

发布于 2024-12-25 09:44:28 字数 3259 浏览 2 评论 0原文

使用以下代码，我尝试从我们的电话提供商的 Web 应用程序中抓取呼叫日志，以将信息输入到我的 Ruby on Rails 应用程序中。

desc "Import incoming calls"
task :fetch_incomingcalls => :environment do

    # Logs into manage.phoneprovider.co.uk and retrieved list of incoming calls.
    require 'rubygems'
    require 'mechanize'
    require 'logger'

    # Create a new mechanize object
    agent = Mechanize.new { |a| a.log = Logger.new(STDERR) }

    # Load the Phone Provider website
    page = agent.get("https://manage.phoneprovider.co.uk/login")

    # Select the first form
    form = agent.page.forms.first
    form.username = 'username
    form.password = 'password

    # Submit the form
    page = form.submit form.buttons.first

    # Click on link called Call Logs
    page = agent.page.link_with(:text => "Call Logs").click

    # Click on link called Incoming Calls
    page = agent.page.link_with(:text => "Incoming Calls").click

    # Prints out table rows
    # puts doc.css('table > tr')

    # Print out the body as a test
    # puts page.body

end

正如您从最后五行中看到的，我已经测试了“puts page.body”成功运行并且上面的代码有效。它成功登录，然后导航到“呼叫日志”，然后导航到“来电”。来电表如下所示：

| Timestamp    |    Source    |    Destination    |    Duration    |
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |    
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |    
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |    
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |

由以下代码生成：

<thead>
<tr>
<td>Timestamp</td>
<td>Source</td>
<td>Destination</td>
<td>Duration</td>
<td>Cost</td>
<td class='centre'>Recording</td>
</tr>
</thead>
<tbody>
<tr class='o'>
<tr>
<td>03 Jan 13:40</td>
<td>12345678</td>
<td>12345679</td>
<td>00:01:14</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>30 Dec 20:31</td>
<td>12345678</td>
<td>12345679</td>
<td>00:02:52</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>24 Dec 00:03</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:09</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>23 Dec 14:56</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:07</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>21 Dec 13:26</td>
<td>07793770851</td>
<td>12345679</td>
<td>00:00:26</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>

我正在尝试弄清楚如何仅选择我想要的单元格（时间戳、源）、目的地和持续时间）并输出它们。然后我可以担心将它们输出到数据库而不是终端中。

我尝试过使用选择器小工具，但如果我选择多个，它只会显示“td”或“tr：nth-child（6）td，tr：nth-child（2）td”。

任何帮助或指示将不胜感激！

原文

Using the following code I am trying to scrape a call log from our phone provider's web application to enter the info into my Ruby on Rails application.

desc "Import incoming calls"
task :fetch_incomingcalls => :environment do

    # Logs into manage.phoneprovider.co.uk and retrieved list of incoming calls.
    require 'rubygems'
    require 'mechanize'
    require 'logger'

    # Create a new mechanize object
    agent = Mechanize.new { |a| a.log = Logger.new(STDERR) }

    # Load the Phone Provider website
    page = agent.get("https://manage.phoneprovider.co.uk/login")

    # Select the first form
    form = agent.page.forms.first
    form.username = 'username
    form.password = 'password

    # Submit the form
    page = form.submit form.buttons.first

    # Click on link called Call Logs
    page = agent.page.link_with(:text => "Call Logs").click

    # Click on link called Incoming Calls
    page = agent.page.link_with(:text => "Incoming Calls").click

    # Prints out table rows
    # puts doc.css('table > tr')

    # Print out the body as a test
    # puts page.body

end

As you can see from the last five lines, I have tested that the 'puts page.body' works successfully and the above code works. It successfully logs in and then navigates to Call Logs followed by Incoming Calls.The incoming call table looks like this:

| Timestamp    |    Source    |    Destination    |    Duration    |
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |    
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |    
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |    
| 03 Jan 13:40 |    12345678  |    12345679       |    00:01:01    |

Which is generated from the following code:

<thead>
<tr>
<td>Timestamp</td>
<td>Source</td>
<td>Destination</td>
<td>Duration</td>
<td>Cost</td>
<td class='centre'>Recording</td>
</tr>
</thead>
<tbody>
<tr class='o'>
<tr>
<td>03 Jan 13:40</td>
<td>12345678</td>
<td>12345679</td>
<td>00:01:14</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>30 Dec 20:31</td>
<td>12345678</td>
<td>12345679</td>
<td>00:02:52</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>24 Dec 00:03</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:09</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='e'>
<tr>
<td>23 Dec 14:56</td>
<td>12345678</td>
<td>12345679</td>
<td>00:00:07</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>
<tr class='o'>
<tr>
<td>21 Dec 13:26</td>
<td>07793770851</td>
<td>12345679</td>
<td>00:00:26</td>
<td></td>
<td class='opt recording'>
</td>
</tr>
</tr>

I'm trying to work out how to selects just the cells I want (Timestamp, Source, Destination and Duration) and output those. I can then worry about outputting them to the database rather than in Terminal.

I have tried using Selector Gadget but it just show either 'td' or 'tr:nth-child(6) td , tr:nth-child(2) td' if I select multiple.

Any help or pointers would be appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

揽清风入怀 2025-01-01 09:44:28

表中有一个使用 XPath 可以轻松利用的模式。包含所需信息的行的标记缺少 class 属性。幸运的是，XPath 提供了一些简单的逻辑运算，包括 not() 。这正好提供了我们需要的功能。

一旦减少了要处理的行数，我们就可以使用 XPath 的 element[n] 选择器迭代行并提取必要列的文本。这里需要注意的一个重要事项是，XPath 从 1 开始计算元素，因此表行的第一列将为 td[1]。

使用 Nokogiri（和规格）的示例代码：

require "rspec"
require "nokogiri"

HTML = <<HTML
<table>
  <thead>
    <tr>
      <td>
        Timestamp
      </td>
      <td>
        Source
      </td>
      <td>
        Destination
      </td>
      <td>
        Duration
      </td>
      <td>
        Cost
      </td>
      <td class='centre'>
        Recording
      </td>
    </tr>
  </thead>
  <tbody>
    <tr class='o'>
      <td></td>
    </tr>
    <tr>
      <td>
        03 Jan 13:40
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:01:14
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='e'>
      <td></td>
    </tr>
    <tr>
      <td>
        30 Dec 20:31
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:02:52
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='o'>
      <td></td>
    </tr>
    <tr>
      <td>
        24 Dec 00:03
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:00:09
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='e'>
      <td></td>
    </tr>
    <tr>
      <td>
        23 Dec 14:56
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:00:07
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='o'>
      <td></td>
    </tr>
    <tr>
      <td>
        21 Dec 13:26
      </td>
      <td>
        07793770851
      </td>
      <td>
        12345679
      </td>
      <td>
        00:00:26
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
  </tbody>
</table>
HTML

class TableExtractor  
  def extract_data html
    Nokogiri::HTML(html).xpath("//table/tbody/tr[not(@class)]").collect do |row|
      timestamp   = row.at("td[1]").text.strip
      source      = row.at("td[2]").text.strip
      destination = row.at("td[3]").text.strip
      duration    = row.at("td[4]").text.strip
      {:timestamp => timestamp, :source => source, :destination => destination, :duration => duration}
    end
  end
end

describe TableExtractor do
  before(:all) do
    @html = HTML
  end

  it "should extract the timestamp properly" do
    subject.extract_data(@html)[0][:timestamp].should eq "03 Jan 13:40"
  end

  it "should extract the source properly" do
    subject.extract_data(@html)[0][:source].should eq "12345678"
  end

  it "should extract the destination properly" do
    subject.extract_data(@html)[0][:destination].should eq "12345679"
  end

  it "should extract the duration properly" do
    subject.extract_data(@html)[0][:duration].should eq "00:01:14"
  end

  it "should extract all informational rows" do
    subject.extract_data(@html).count.should eq 5
  end
end

There is a pattern in the table that is easy to leverage using XPath. The <tr> tag of rows with the required information lack the class attribute. Fortunately, XPath provides some simple logical operations including not(). This provides just the functionality we need.

Once we've reduced the number of rows we're dealing with, we can iterate over the rows and extract the text of the necessary columns by using XPath's element[n] selector. One important note here is that XPath counts elements starting from 1, so the first column of a table row would be td[1].

Example code using Nokogiri (and specs):

require "rspec"
require "nokogiri"

HTML = <<HTML
<table>
  <thead>
    <tr>
      <td>
        Timestamp
      </td>
      <td>
        Source
      </td>
      <td>
        Destination
      </td>
      <td>
        Duration
      </td>
      <td>
        Cost
      </td>
      <td class='centre'>
        Recording
      </td>
    </tr>
  </thead>
  <tbody>
    <tr class='o'>
      <td></td>
    </tr>
    <tr>
      <td>
        03 Jan 13:40
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:01:14
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='e'>
      <td></td>
    </tr>
    <tr>
      <td>
        30 Dec 20:31
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:02:52
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='o'>
      <td></td>
    </tr>
    <tr>
      <td>
        24 Dec 00:03
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:00:09
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='e'>
      <td></td>
    </tr>
    <tr>
      <td>
        23 Dec 14:56
      </td>
      <td>
        12345678
      </td>
      <td>
        12345679
      </td>
      <td>
        00:00:07
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
    <tr class='o'>
      <td></td>
    </tr>
    <tr>
      <td>
        21 Dec 13:26
      </td>
      <td>
        07793770851
      </td>
      <td>
        12345679
      </td>
      <td>
        00:00:26
      </td>
      <td></td>
      <td class='opt recording'></td>
    </tr>
  </tbody>
</table>
HTML

class TableExtractor  
  def extract_data html
    Nokogiri::HTML(html).xpath("//table/tbody/tr[not(@class)]").collect do |row|
      timestamp   = row.at("td[1]").text.strip
      source      = row.at("td[2]").text.strip
      destination = row.at("td[3]").text.strip
      duration    = row.at("td[4]").text.strip
      {:timestamp => timestamp, :source => source, :destination => destination, :duration => duration}
    end
  end
end

describe TableExtractor do
  before(:all) do
    @html = HTML
  end

  it "should extract the timestamp properly" do
    subject.extract_data(@html)[0][:timestamp].should eq "03 Jan 13:40"
  end

  it "should extract the source properly" do
    subject.extract_data(@html)[0][:source].should eq "12345678"
  end

  it "should extract the destination properly" do
    subject.extract_data(@html)[0][:destination].should eq "12345679"
  end

  it "should extract the duration properly" do
    subject.extract_data(@html)[0][:duration].should eq "00:01:14"
  end

  it "should extract all informational rows" do
    subject.extract_data(@html).count.should eq 5
  end
end

回复收藏 0 原文