如何使用 Nokogiri 和 Ruby 通过嵌套表从 HTML 中抓取值?

发布于 2024-11-06 18:58:14 字数 2728 浏览 0 评论 0原文

我正在尝试从我正在使用 Nokogiri 解析的页面中提取姓名、ID、电话、电子邮件、性别、种族、出生日期、班级、专业、学校和 GPA。

我尝试了一些不同的 xpath,但我尝试的所有内容都比我想要的要多得多:

<span class="subTitle"><b>Recruit Profile</b></span>
<br><table border="0" width="100%"><tr>
<td>
      <table bgcolor="#afafaf" border="0" cellpadding="0" width="100%">
<tr>
<td>
      <table bgcolor="#cccccc" border="0" cellpadding="2" cellspacing="2" width="100%">
<tr>
<td bgcolor="#dddddd"><b>Name</b></td>
          <td bgcolor="#dddddd">Some Person</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>EDU ID</b></td>
          <td bgcolor="#dddddd">A12345678</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Phone</b></td>
          <td bgcolor="#dddddd">123-456-7890</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Address</b></td>
          <td bgcolor="#dddddd">1234 Somewhere Dr.<br>City ST, 12345</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Email</b></td>
          <td bgcolor="#dddddd">[email protected]</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Gender</b></td>
          <td bgcolor="#dddddd">Female</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Ethnicity</b></td>
          <td bgcolor="#dddddd">Unknown</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Date of Birth</b></td>
          <td bgcolor="#dddddd">Jan 1st, 1901</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Class</b></td>
          <td bgcolor="#dddddd">Sophomore</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Major</b></td>
          <td bgcolor="#dddddd">Biology</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>School</b></td>
          <td bgcolor="#dddddd">University of Somewhere</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>GPA</b></td>
          <td bgcolor="#dddddd">0.00</td>
        </tr>
<tr>
<td bgcolor="#dddddd" valign="top"><b>Availability</b></td>
          <td bgcolor="#dddddd">
      <table border="0" cellspacing="0" cellpadding="0">
<tr>

I am trying to extract the name, ID, Phone, Email, Gender, Ethnicity, DOB, Class, Major, School and GPA from a page I am parsing with Nokogiri.

I tried some different xpath's but everything I try grabs much more than I want:

<span class="subTitle"><b>Recruit Profile</b></span>
<br><table border="0" width="100%"><tr>
<td>
      <table bgcolor="#afafaf" border="0" cellpadding="0" width="100%">
<tr>
<td>
      <table bgcolor="#cccccc" border="0" cellpadding="2" cellspacing="2" width="100%">
<tr>
<td bgcolor="#dddddd"><b>Name</b></td>
          <td bgcolor="#dddddd">Some Person</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>EDU ID</b></td>
          <td bgcolor="#dddddd">A12345678</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Phone</b></td>
          <td bgcolor="#dddddd">123-456-7890</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Address</b></td>
          <td bgcolor="#dddddd">1234 Somewhere Dr.<br>City ST, 12345</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Email</b></td>
          <td bgcolor="#dddddd">[email protected]</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Gender</b></td>
          <td bgcolor="#dddddd">Female</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Ethnicity</b></td>
          <td bgcolor="#dddddd">Unknown</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Date of Birth</b></td>
          <td bgcolor="#dddddd">Jan 1st, 1901</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Class</b></td>
          <td bgcolor="#dddddd">Sophomore</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Major</b></td>
          <td bgcolor="#dddddd">Biology</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>School</b></td>
          <td bgcolor="#dddddd">University of Somewhere</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>GPA</b></td>
          <td bgcolor="#dddddd">0.00</td>
        </tr>
<tr>
<td bgcolor="#dddddd" valign="top"><b>Availability</b></td>
          <td bgcolor="#dddddd">
      <table border="0" cellspacing="0" cellpadding="0">
<tr>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

同尘 2024-11-13 18:58:14

我假设会有许多“招聘资料”跨度,后面是包含所有详细信息的表格。以下方法获取整个 HTML 页面,仅查找这些跨度,并为每个跨度查找下表,然后在该表下方的任意位置查找所需的字段:

require 'nokogiri'

# Pass in or set the array of labels you want to use
# Returns an array of hashes mapping these labels to the values
def recruits_details(html,fields=%W[Name #{"EDU ID"} Phone Email Gender])
  doc = Nokogiri::HTML(html)
  recruit_labels = doc.xpath('//span[b[text()="Recruit Profile"]]')
  recruit_labels.map do |recruit_label|
    recruit_table = recruit_label.at_xpath('following-sibling::table')
    Hash[ fields.map do |field_label|
      label_td = recruit_table.at_xpath(".//td[b[text()='#{field_label}']]")
      [field_label, label_td.at_xpath('following-sibling::td/text()').text ]
    end ]
  end
end

require 'pp'
pp recruits_details(html_string)
#=> [{"Name"=>"Some Person",
#=>   "EDU ID"=>"A12345678",
#=>   "Phone"=>"123-456-7890",
#=>   "Email"=>"[email protected]",
#=>   "Gender"=>"Female"}]

XPath 表达式,如 .//foo[bar[ text()="jim"]] 表示:

  • 在当前节点下的任何位置查找“foo”元素
  • ...但前提是它具有“bar”元素一个孩子
  • ...但前提是'bar' 元素的内容为文本“jim”

following-sibling::... 这样的 XPath 表达式意味着 查找当前节点之后的同级元素匹配表达式 ...

XPath 表达式 .../text() 选择 文本节点text 方法用于提取该文本节点的值(实际字符串)。

Nokogiri 的 xpath 方法返回一个数组与表达式匹配的所有元素,而 at_xpath 方法返回与表达式匹配的第一个元素。

I assume that there will be many "Recruit Profile" spans that are followed by tables that wrap up all the details. The following method takes your entire HTML page, finds just those spans, and for each of them it finds the following table and then finds the fields you want anywhere below that table:

require 'nokogiri'

# Pass in or set the array of labels you want to use
# Returns an array of hashes mapping these labels to the values
def recruits_details(html,fields=%W[Name #{"EDU ID"} Phone Email Gender])
  doc = Nokogiri::HTML(html)
  recruit_labels = doc.xpath('//span[b[text()="Recruit Profile"]]')
  recruit_labels.map do |recruit_label|
    recruit_table = recruit_label.at_xpath('following-sibling::table')
    Hash[ fields.map do |field_label|
      label_td = recruit_table.at_xpath(".//td[b[text()='#{field_label}']]")
      [field_label, label_td.at_xpath('following-sibling::td/text()').text ]
    end ]
  end
end

require 'pp'
pp recruits_details(html_string)
#=> [{"Name"=>"Some Person",
#=>   "EDU ID"=>"A12345678",
#=>   "Phone"=>"123-456-7890",
#=>   "Email"=>"[email protected]",
#=>   "Gender"=>"Female"}]

An XPath expression like .//foo[bar[text()="jim"]] means:

  • Find a 'foo' element anywhere under the current node
  • ...but only if it has a 'bar' element as a child
  • ...but only if that 'bar' element has the text "jim" as its content

An XPath expression like following-sibling::... means Find any elements that are siblings after the current node that match the expression ...

The XPath expression .../text() selects the Text node; the text method is used to extract the value (actual string) of that text node.

Nokogiri's xpath method returns an array of all elements matching the expression, while the at_xpath method returns the first element matching the expression.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文