Beautifulsoup 在表中获得价值
我正在尝试刮 http://www.co.jefferson.co.us/ ats/displaygeneral.do?sch=000104 并获取“所有者姓名” 我所拥有的有效,但真的很难看,而且不是我确信的最好的方法,所以我正在寻找更好的方法。 这是我所拥有的:
soup = BeautifulSoup(url_opener.open(url))
x = soup('table', text = re.compile("Owner Name"))
print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next
相关的 HTML 是
<td valign="top">
<table border="1" cellpadding="1" cellspacing="0" align="right">
<tbody><tr class="tableheaders">
<td>Owner Name(s)</td>
</tr>
<tr>
<td>PILCHER DONALD L </td>
</tr>
</tbody></table>
</td>
哇,有很多关于 beautifulsoup 的问题,我浏览了它们,但没有找到对我有帮助的答案,希望这不是重复的问题
I am trying to scrape
http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104
and get the "owner Name(s)"
What I have works but is really ugly and not the best I am sure, so I am looking for a better way.
Here is what I have:
soup = BeautifulSoup(url_opener.open(url))
x = soup('table', text = re.compile("Owner Name"))
print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next
The relevant HTML is
<td valign="top">
<table border="1" cellpadding="1" cellspacing="0" align="right">
<tbody><tr class="tableheaders">
<td>Owner Name(s)</td>
</tr>
<tr>
<td>PILCHER DONALD L </td>
</tr>
</tbody></table>
</td>
Wow, there are lots of questions about beautifulsoup, I looked through them but didn't find an answer that helped me, hopefully this is not a duplicate question
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
(编辑:显然OP发布的HTML是谎言——实际上没有
tbody
标签可供查找,尽管他特意将其包含在该HTML中。因此,更改为使用table
而不是tbody
)。由于您可能需要多个表行(例如,查看您给出的表行的同级 URL,最后一位数字 4 更改为 5),我建议使用如下循环:
这对于页面结构的微小变化:找到感兴趣的单元格后,它会循环其父级,直到找到表标记,然后遍历该表中所有非空(或只是空格)的可导航字符串,不包括
所有者
标头。(Edit: apparently the HTML the OP posted lies -- there is in fact no
tbody
tag to look for, even though he made it a point of including in that HTML. So, changing to usetable
instead oftbody
).As there may be several table-rows you want (e.g., see the sibling URL to the one you give, with the last digit, 4, changed into a 5), I suggest a loop such as the following:
this is reasonably robust to minor changes in page structure: having located the cell of interest, it loops up its parents until it's found the table tag, then over all navigable strings within that table that aren't empty (or just whitespace), excluding the
owner
header.这是 Aaron DeVore 在 Beautifulsoup 讨论组中的回答,它对我来说效果很好。
需要 Tag.string 来获取实际的名称字符串
如果您正在做一堆,您甚至可以进行列表理解。
This is Aaron DeVore's answer from the Beautifulsoup discussion group, It work well for me.
Needs Tag.string to get to the actual name string
If you're doing a bunch of them, you can even go for a list comprehension.
这是一个小小的进步,但我不知道如何摆脱三个父母。
This is a slight improvement, but I couldn't figure out how to get rid of the three parents.