Beautifulsoup 在表中获得价值

发布于 2024-08-13 16:33:38 字数 929 浏览 13 评论 0原文

我正在尝试刮 http://www.co.jefferson.co.us/ ats/displaygeneral.do?sch=000104 并获取“所有者姓名” 我所拥有的有效,但真的很难看,而且不是我确信的最好的方法,所以我正在寻找更好的方法。 这是我所拥有的:

soup = BeautifulSoup(url_opener.open(url))            
x = soup('table', text = re.compile("Owner Name"))
print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next

相关的 HTML 是

<td valign="top">
    <table border="1" cellpadding="1" cellspacing="0" align="right">
    <tbody><tr class="tableheaders">
    <td>Owner Name(s)</td>
    </tr>

    <tr>

    <td>PILCHER DONALD L                         </td>
    </tr>

    </tbody></table>
</td>

哇,有很多关于 beautifulsoup 的问题,我浏览了它们,但没有找到对我有帮助的答案,希望这不是重复的问题

I am trying to scrape
http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104
and get the "owner Name(s)"
What I have works but is really ugly and not the best I am sure, so I am looking for a better way.
Here is what I have:

soup = BeautifulSoup(url_opener.open(url))            
x = soup('table', text = re.compile("Owner Name"))
print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next

The relevant HTML is

<td valign="top">
    <table border="1" cellpadding="1" cellspacing="0" align="right">
    <tbody><tr class="tableheaders">
    <td>Owner Name(s)</td>
    </tr>

    <tr>

    <td>PILCHER DONALD L                         </td>
    </tr>

    </tbody></table>
</td>

Wow, there are lots of questions about beautifulsoup, I looked through them but didn't find an answer that helped me, hopefully this is not a duplicate question

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

魄砕の薆 2024-08-20 16:33:38

编辑:显然OP发布的HTML是谎言——实际上没有tbody标签可供查找,尽管他特意将其包含在该HTML中。因此,更改为使用 table 而不是 tbody)。

由于您可能需要多个表行(例如,查看您给出的表行的同级 URL,最后一位数字 4 更改为 5),我建议使用如下循环:

# locate the table containing a cell with the given text
owner = re.compile('Owner Name')
cell = soup.find(text=owner).parent
while cell.name != 'table': cell = cell.parent
# print all non-empty strings in the table (except for the given text)
for x in cell.findAll(text=lambda x: x.strip() and not owner.match(x)):
  print x

这对于页面结构的微小变化:找到感兴趣的单元格后,它会循环其父级,直到找到表标记,然后遍历该表中所有非空(或只是空格)的可导航字符串,不包括 所有者 标头。

(Edit: apparently the HTML the OP posted lies -- there is in fact no tbody tag to look for, even though he made it a point of including in that HTML. So, changing to use table instead of tbody).

As there may be several table-rows you want (e.g., see the sibling URL to the one you give, with the last digit, 4, changed into a 5), I suggest a loop such as the following:

# locate the table containing a cell with the given text
owner = re.compile('Owner Name')
cell = soup.find(text=owner).parent
while cell.name != 'table': cell = cell.parent
# print all non-empty strings in the table (except for the given text)
for x in cell.findAll(text=lambda x: x.strip() and not owner.match(x)):
  print x

this is reasonably robust to minor changes in page structure: having located the cell of interest, it loops up its parents until it's found the table tag, then over all navigable strings within that table that aren't empty (or just whitespace), excluding the owner header.

陌生 2024-08-20 16:33:38

这是 Aaron DeVore 在 Beautifulsoup 讨论组中的回答,它对我来说效果很好。

soup = BeautifulSoup(...)
label = soup.find(text="Owner Name(s)")

需要 Tag.string 来获取实际的名称字符串

name = label.findNext('td').string

如果您正在做一堆,您甚至可以进行列表理解。

names = [unicode(label.findNext('td').string) for label in
soup.findAll(text="Owner Name(s)")]

This is Aaron DeVore's answer from the Beautifulsoup discussion group, It work well for me.

soup = BeautifulSoup(...)
label = soup.find(text="Owner Name(s)")

Needs Tag.string to get to the actual name string

name = label.findNext('td').string

If you're doing a bunch of them, you can even go for a list comprehension.

names = [unicode(label.findNext('td').string) for label in
soup.findAll(text="Owner Name(s)")]
樱花细雨 2024-08-20 16:33:38

这是一个小小的进步,但我不知道如何摆脱三个父母。

x[0].parent.parent.parent.findAll('td')[1].string

This is a slight improvement, but I couldn't figure out how to get rid of the three parents.

x[0].parent.parent.parent.findAll('td')[1].string
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文