使用 BeautifulSoup 提取标签内的内容

发布于 2024-11-07 12:07:19 字数 779 浏览 2 评论 0原文

我想提取内容Hello world。请注意,页面上还有多个 和类似的

<table border="0" cellspacing="2" width="800">
  <tr>
    <td colspan="2"><b>Name: </b>Hello world</td>
  </tr>
  <tr>
...

我尝试了以下操作:

hello = soup.find(text='Name: ')
hello.findPreviousSiblings

但它返回了没有什么。

此外,我在提取我的家庭地址时也遇到问题:

<td><b>Address:</b></td>

<td>My home address</td>

我也在使用相同的方法搜索text="Address: "但如何向下导航到下一行并提取 的内容?

I'd like to extract the content Hello world. Please note that there are multiples <table> and similar <td colspan="2"> on the page as well:

<table border="0" cellspacing="2" width="800">
  <tr>
    <td colspan="2"><b>Name: </b>Hello world</td>
  </tr>
  <tr>
...

I tried the following:

hello = soup.find(text='Name: ')
hello.findPreviousSiblings

But it returned nothing.

In addition, I'm also having problem with the following extracting the My home address:

<td><b>Address:</b></td>

<td>My home address</td>

I'm also using the same method to search for the text="Address: " but how do I navigate down to the next line and extract the content of <td>?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

孤独岁月 2024-11-14 12:07:19

contents 运算符非常适合从 text 中提取 text


我的家庭地址示例:

s = '<td>My home address</td>'
soup =  BeautifulSoup(s)
td = soup.find('td') #<td>My home address</td>
td.contents #My home address

地址:示例:

s = '<td><b>Address:</b></td>'
soup =  BeautifulSoup(s)
td = soup.find('td').find('b') #<b>Address:</b>
td.contents #Address:

The contents operator works well for extracting text from <tag>text</tag> .


<td>My home address</td> example:

s = '<td>My home address</td>'
soup =  BeautifulSoup(s)
td = soup.find('td') #<td>My home address</td>
td.contents #My home address

<td><b>Address:</b></td> example:

s = '<td><b>Address:</b></td>'
soup =  BeautifulSoup(s)
td = soup.find('td').find('b') #<b>Address:</b>
td.contents #Address:
半城柳色半声笛 2024-11-14 12:07:19

使用 .next 代替:

>>> s = '<table border="0" cellspacing="2" width="800"><tr><td colspan="2"><b>Name: </b>Hello world</td></tr><tr>'
>>> soup = BeautifulSoup(s)
>>> hello = soup.find(text='Name: ')
>>> hello.next
u'Hello world'

.next.previous 允许您按照解析器处理文档元素的顺序移动文档元素,而同级方法与解析树一起使用。

Use .next instead:

>>> s = '<table border="0" cellspacing="2" width="800"><tr><td colspan="2"><b>Name: </b>Hello world</td></tr><tr>'
>>> soup = BeautifulSoup(s)
>>> hello = soup.find(text='Name: ')
>>> hello.next
u'Hello world'

.next and .previous lets you move through the document elements in the order they were processed by the parser, while sibling methods work with the parse tree.

窗影残 2024-11-14 12:07:19

使用下面的代码使用 python beautifulSoup 从 html 标签中提取文本和内容

s = '<td>Example information</td>' # your raw html
soup =  BeautifulSoup(s) #parse html with BeautifulSoup
td = soup.find('td') #tag of interest <td>Example information</td>
td.text #Example information # clean text from html

Use the below code to get extract text and content from html tags with python beautifulSoup

s = '<td>Example information</td>' # your raw html
soup =  BeautifulSoup(s) #parse html with BeautifulSoup
td = soup.find('td') #tag of interest <td>Example information</td>
td.text #Example information # clean text from html
南汐寒笙箫 2024-11-14 12:07:19
from bs4 import BeautifulSoup, Tag

def get_tag_html(tag: Tag):
    return ''.join([i.decode() if type(i) is Tag else i for i in tag.contents])
from bs4 import BeautifulSoup, Tag

def get_tag_html(tag: Tag):
    return ''.join([i.decode() if type(i) is Tag else i for i in tag.contents])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文