如何使用 Nokogiri 和 Ruby 通过嵌套表从 HTML 中抓取值?
我正在尝试从我正在使用 Nokogiri 解析的页面中提取姓名、ID、电话、电子邮件、性别、种族、出生日期、班级、专业、学校和 GPA。
我尝试了一些不同的 xpath,但我尝试的所有内容都比我想要的要多得多:
<span class="subTitle"><b>Recruit Profile</b></span>
<br><table border="0" width="100%"><tr>
<td>
<table bgcolor="#afafaf" border="0" cellpadding="0" width="100%">
<tr>
<td>
<table bgcolor="#cccccc" border="0" cellpadding="2" cellspacing="2" width="100%">
<tr>
<td bgcolor="#dddddd"><b>Name</b></td>
<td bgcolor="#dddddd">Some Person</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>EDU ID</b></td>
<td bgcolor="#dddddd">A12345678</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Phone</b></td>
<td bgcolor="#dddddd">123-456-7890</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Address</b></td>
<td bgcolor="#dddddd">1234 Somewhere Dr.<br>City ST, 12345</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Email</b></td>
<td bgcolor="#dddddd">[email protected]</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Gender</b></td>
<td bgcolor="#dddddd">Female</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Ethnicity</b></td>
<td bgcolor="#dddddd">Unknown</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Date of Birth</b></td>
<td bgcolor="#dddddd">Jan 1st, 1901</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Class</b></td>
<td bgcolor="#dddddd">Sophomore</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Major</b></td>
<td bgcolor="#dddddd">Biology</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>School</b></td>
<td bgcolor="#dddddd">University of Somewhere</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>GPA</b></td>
<td bgcolor="#dddddd">0.00</td>
</tr>
<tr>
<td bgcolor="#dddddd" valign="top"><b>Availability</b></td>
<td bgcolor="#dddddd">
<table border="0" cellspacing="0" cellpadding="0">
<tr>
I am trying to extract the name, ID, Phone, Email, Gender, Ethnicity, DOB, Class, Major, School and GPA from a page I am parsing with Nokogiri.
I tried some different xpath's but everything I try grabs much more than I want:
<span class="subTitle"><b>Recruit Profile</b></span>
<br><table border="0" width="100%"><tr>
<td>
<table bgcolor="#afafaf" border="0" cellpadding="0" width="100%">
<tr>
<td>
<table bgcolor="#cccccc" border="0" cellpadding="2" cellspacing="2" width="100%">
<tr>
<td bgcolor="#dddddd"><b>Name</b></td>
<td bgcolor="#dddddd">Some Person</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>EDU ID</b></td>
<td bgcolor="#dddddd">A12345678</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Phone</b></td>
<td bgcolor="#dddddd">123-456-7890</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Address</b></td>
<td bgcolor="#dddddd">1234 Somewhere Dr.<br>City ST, 12345</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Email</b></td>
<td bgcolor="#dddddd">[email protected]</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Gender</b></td>
<td bgcolor="#dddddd">Female</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Ethnicity</b></td>
<td bgcolor="#dddddd">Unknown</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Date of Birth</b></td>
<td bgcolor="#dddddd">Jan 1st, 1901</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Class</b></td>
<td bgcolor="#dddddd">Sophomore</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Major</b></td>
<td bgcolor="#dddddd">Biology</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>School</b></td>
<td bgcolor="#dddddd">University of Somewhere</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>GPA</b></td>
<td bgcolor="#dddddd">0.00</td>
</tr>
<tr>
<td bgcolor="#dddddd" valign="top"><b>Availability</b></td>
<td bgcolor="#dddddd">
<table border="0" cellspacing="0" cellpadding="0">
<tr>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我假设会有许多“招聘资料”跨度,后面是包含所有详细信息的表格。以下方法获取整个 HTML 页面,仅查找这些跨度,并为每个跨度查找下表,然后在该表下方的任意位置查找所需的字段:
XPath 表达式,如
.//foo[bar[ text()="jim"]]
表示:像
following-sibling::...
这样的 XPath 表达式意味着 查找当前节点之后的同级元素匹配表达式...
XPath 表达式
.../text()
选择 文本节点;text
方法用于提取该文本节点的值(实际字符串)。Nokogiri 的
xpath
方法返回一个数组与表达式匹配的所有元素,而at_xpath
方法返回与表达式匹配的第一个元素。I assume that there will be many "Recruit Profile" spans that are followed by tables that wrap up all the details. The following method takes your entire HTML page, finds just those spans, and for each of them it finds the following table and then finds the fields you want anywhere below that table:
An XPath expression like
.//foo[bar[text()="jim"]]
means:An XPath expression like
following-sibling::...
means Find any elements that are siblings after the current node that match the expression...
The XPath expression
.../text()
selects the Text node; thetext
method is used to extract the value (actual string) of that text node.Nokogiri's
xpath
method returns an array of all elements matching the expression, while theat_xpath
method returns the first element matching the expression.