BeautifulSoup 和表行内的换行符？

发布于 2024-12-28 03:50:14 字数 413 浏览 1 评论 0原文

示例代码：

from BeautifulSoup import BeautifulSoup, SoupStrainer

html='''<tr>
<td align="left">Foo<br />
Bar<br /></td>
</tr>'''

soup=BeautifulSoup(html)
rows=soup.findAll('tr')
print rows
print rows[0].text.encode("utf8")

我希望输出类似于“Foo Bar”，或者即使两行之间有一个实际的换行符也可以，但我得到的输出只有“FooBar”，请注意没有空格两条线之间。

对 python 和 beautifulsoup 很陌生，有人可以帮忙吗？

原文

Sample code:

from BeautifulSoup import BeautifulSoup, SoupStrainer

html='''<tr>
<td align="left">Foo<br />
Bar<br /></td>
</tr>'''

soup=BeautifulSoup(html)
rows=soup.findAll('tr')
print rows
print rows[0].text.encode("utf8")

I would like the output to be something like "Foo Bar" or even if there was an actual newline between the two lines that would be fine, but the output I get just has "FooBar", note that there is no whitespace between the two lines.

Very new to python and beautifulsoup, can someone give a hand?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱殇璃 2025-01-04 03:50:14

您可以使用 cell = rows[0].find('td') 更进一步，然后使用 cell.contents 查看其内容，然后过滤您需要的元素，然后用空格连接它们。

另一种选择：您可以使用正则表达式将替换为空格。为此，您可以编写：

import re
s = re.sub('<br\s*?>', ' ', rows[0].text)

然后您可以将多个连续的空格替换为

s = re.sub('\s+', ' ', s)

然后字符串应如下所示：

>>> print s
<tr> <td align="left">Foo Bar </td> </tr>

然后您可以轻松提取所需的部分。

You can go one level further using cell = rows[0].find('td'), then see its contents using cell.contents, then filter the elements you need, then join them by spaces.

Another option: you can use a regular expression for replacing the <br /> by a space. for that you can write:

import re
s = re.sub('<br\s*?>', ' ', rows[0].text)

Then you can replace multiple consecutive whitespaces by

s = re.sub('\s+', ' ', s)

Then the string should look like this:

>>> print s
<tr> <td align="left">Foo Bar </td> </tr>

Then you can easily extract the part you need.

回复收藏 0 原文

妞丶爷亲个 2025-01-04 03:50:14

您可能需要考虑使用 lxml 而不是 BeautifulSoup。 lxml 使您能够使用 XPath 搜索元素（我认为）比使用 BeautifulSoup 的 API 更容易。

import lxml.html as LH

html='''<tr>
<td align="left">Foo<br />
Bar<br /></td>
</tr>'''

doc = LH.fromstring(html)
for tr in doc.xpath('//tr'):
    print(repr(tr.text_content()))

产量

'Foo\nBar\n'

和

for text in doc.xpath('//tr/*/text()'):
    print(repr(text))

产量

'Foo'
'\nBar'

You might want to consider using lxml instead of BeautifulSoup. lxml enables you to search for elements using XPath which (I think) is easier than using BeautifulSoup's API.

import lxml.html as LH

html='''<tr>
<td align="left">Foo<br />
Bar<br /></td>
</tr>'''

doc = LH.fromstring(html)
for tr in doc.xpath('//tr'):
    print(repr(tr.text_content()))

yields

'Foo\nBar\n'

and

for text in doc.xpath('//tr/*/text()'):
    print(repr(text))

yields

'Foo'
'\nBar'

回复收藏 0 原文

~没有更多了~

关于作者

坦然微笑

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

BeautifulSoup 和表行内的换行符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

BeautifulSoup 和表行内的换行符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。