如何删除 BeautifulSoup 中的空格

发布于 2024-10-03 21:14:20 字数 964 浏览 0 评论 0原文

我正在使用 BeautifulSoup 解析一堆 HTML，除了一个小问题外，一切进展顺利。我想将输出保存到单行字符串中，以下内容作为我当前的输出：

    <li><span class="plaincharacterwrap break">
                    Zazzafooky but one two three!
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky2
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky3
                </span></li>

理想情况下，我希望

<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>

有很多多余的空格我想删除，但不一定可以使用 删除strip()，我也不能公然删除所有空格，因为我需要保留文本。我该怎么做呢？这似乎是一个很常见的问题，正则表达式会显得矫枉过正，但这是唯一的方法吗？

我没有任何

 标签，所以我可以在那里更有力一点。

再次感谢！

原文

I have a bunch of HTML I'm parsing with BeautifulSoup and it's been going pretty well except for one minor snag. I want to save the output into a single-lined string, with the following as my current output:

    <li><span class="plaincharacterwrap break">
                    Zazzafooky but one two three!
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky2
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky3
                </span></li>

Ideally I'd like

<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>

There's a lot of redundant whitespace that I'd like to get rid of but it's not necessarily removable using strip(), nor can I blatantly remove all the spaces because I need to retain the text. How can I do it? It seems like a common enough problem that regex would be overkill, but is that the only way?

I don't have any <pre> tags so I can be a little more forceful there.

Thanks once again!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

南城旧梦 2024-10-10 21:14:20

以下是不使用正则表达式的方法：

>>> html = """    <li><span class="plaincharacterwrap break">
...                     Zazzafooky but one two three!
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky2
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky3
...                 </span></li>
... """
>>> html = "".join(line.strip() for line in html.split("\n"))
>>> html
'<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>'

Here is how you can do it without regular expressions:

>>> html = """    <li><span class="plaincharacterwrap break">
...                     Zazzafooky but one two three!
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky2
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky3
...                 </span></li>
... """
>>> html = "".join(line.strip() for line in html.split("\n"))
>>> html
'<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>'

回复收藏 0 原文

无边思念无边月 2024-10-10 21:14:20

老问题，我知道，但是 beautifulsoup4 有一个名为 stripped_strings。

试试这个：

description_el = about.find('p', { "class": "description" })
descriptions = list(description_el.stripped_strings)
description = "\n\n".join(descriptions) if descriptions else ""

Old question, I know, but beautifulsoup4 has this helper called stripped_strings.

Try this:

description_el = about.find('p', { "class": "description" })
descriptions = list(description_el.stripped_strings)
description = "\n\n".join(descriptions) if descriptions else ""

回复收藏 0 原文

看春风乍起 2024-10-10 21:14:20

re.sub(r'[\ \n]{2,}', '', yourstring)

当有两个或以上时，正则表达式 [\ \n]{2} 匹配换行符和空格（必须转义）。更彻底的实现是这样的：

re.sub('\ {2,}', '', yourstring)
re.sub('\n*', '', yourstring)

我认为第一个只会替换多个换行符，但它似乎（至少对我来说）工作得很好。

re.sub(r'[\ \n]{2,}', '', yourstring)

Regex [\ \n]{2} matches newlines and spaces (has to be escaped) when there's more than two or more of them. The more thorough implementation is this:

re.sub('\ {2,}', '', yourstring)
re.sub('\n*', '', yourstring)

I would think the first would only replace multiple newlines, but it seems (at least for me) to work just fine.

回复收藏 0 原文

默嘫て 2024-10-10 21:14:20

如果您在被 BeautifulSoup prettify() 困扰后来到这里。我认为这个解决方案不会添加额外的空格。

from lxml import html, etree

doc = html.fromstring(open('inputfile.html').read())
out = open('out.html', 'wb')
out.write(etree.tostring(doc))

请参阅 Ian Bicking 在 stackoverflow 上的回答

通过 xml.etree 进行解析很简单...

from xml.etree import ElementTree as ET
tree = ET.parse('out.html')
title = tree.find(".//title").text
print(title)

In case you came here after getting troubled by BeautifulSoup prettify(). I think this solution won't add extra spaces.

from lxml import html, etree

doc = html.fromstring(open('inputfile.html').read())
out = open('out.html', 'wb')
out.write(etree.tostring(doc))

Please see this Ian Bicking's answer on stackoverflow

Parsing via xml.etree is simple...

from xml.etree import ElementTree as ET
tree = ET.parse('out.html')
title = tree.find(".//title").text
print(title)

回复收藏 0 原文

~没有更多了~

关于作者

傲鸠

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

如何删除 BeautifulSoup 中的空格

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

如何删除 BeautifulSoup 中的空格

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。