如何删除 BeautifulSoup 中的空格

发布于 2024-10-03 21:14:20 字数 964 浏览 0 评论 0原文

我正在使用 BeautifulSoup 解析一堆 HTML,除了一个小问题外,一切进展顺利。我想将输出保存到单行字符串中,以下内容作为我当前的输出:

    <li><span class="plaincharacterwrap break">
                    Zazzafooky but one two three!
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky2
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky3
                </span></li>

理想情况下,我希望

<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>

有很多多余的空格我想删除,但不一定可以使用 删除strip(),我也不能公然删除所有空格,因为我需要保留文本。我该怎么做呢?这似乎是一个很常见的问题,正则表达式会显得矫枉过正,但这是唯一的方法吗?

我没有任何

 标签,所以我可以在那里更有力一点。

再次感谢!

I have a bunch of HTML I'm parsing with BeautifulSoup and it's been going pretty well except for one minor snag. I want to save the output into a single-lined string, with the following as my current output:

    <li><span class="plaincharacterwrap break">
                    Zazzafooky but one two three!
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky2
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky3
                </span></li>

Ideally I'd like

<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>

There's a lot of redundant whitespace that I'd like to get rid of but it's not necessarily removable using strip(), nor can I blatantly remove all the spaces because I need to retain the text. How can I do it? It seems like a common enough problem that regex would be overkill, but is that the only way?

I don't have any <pre> tags so I can be a little more forceful there.

Thanks once again!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

南城旧梦 2024-10-10 21:14:20

以下是不使用正则表达式的方法:

>>> html = """    <li><span class="plaincharacterwrap break">
...                     Zazzafooky but one two three!
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky2
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky3
...                 </span></li>
... """
>>> html = "".join(line.strip() for line in html.split("\n"))
>>> html
'<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>'

Here is how you can do it without regular expressions:

>>> html = """    <li><span class="plaincharacterwrap break">
...                     Zazzafooky but one two three!
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky2
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky3
...                 </span></li>
... """
>>> html = "".join(line.strip() for line in html.split("\n"))
>>> html
'<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>'
无边思念无边月 2024-10-10 21:14:20

老问题,我知道,但是 beautifulsoup4 有一个名为 stripped_strings

试试这个:

description_el = about.find('p', { "class": "description" })
descriptions = list(description_el.stripped_strings)
description = "\n\n".join(descriptions) if descriptions else ""

Old question, I know, but beautifulsoup4 has this helper called stripped_strings.

Try this:

description_el = about.find('p', { "class": "description" })
descriptions = list(description_el.stripped_strings)
description = "\n\n".join(descriptions) if descriptions else ""
看春风乍起 2024-10-10 21:14:20
re.sub(r'[\ \n]{2,}', '', yourstring)

当有两个或以上时,正则表达式 [\ \n]{2} 匹配换行符和空格(必须转义)。更彻底的实现是这样的:

re.sub('\ {2,}', '', yourstring)
re.sub('\n*', '', yourstring)

我认为第一个只会替换多个换行符,但它似乎(至少对我来说)工作得很好。

re.sub(r'[\ \n]{2,}', '', yourstring)

Regex [\ \n]{2} matches newlines and spaces (has to be escaped) when there's more than two or more of them. The more thorough implementation is this:

re.sub('\ {2,}', '', yourstring)
re.sub('\n*', '', yourstring)

I would think the first would only replace multiple newlines, but it seems (at least for me) to work just fine.

默嘫て 2024-10-10 21:14:20

如果您在被 BeautifulSoup prettify() 困扰后来到这里。我认为这个解决方案不会添加额外的空格。

from lxml import html, etree

doc = html.fromstring(open('inputfile.html').read())
out = open('out.html', 'wb')
out.write(etree.tostring(doc))

请参阅 Ian Bicking 在 stackoverflow 上的回答

通过 xml.etree 进行解析很简单...

from xml.etree import ElementTree as ET
tree = ET.parse('out.html')
title = tree.find(".//title").text
print(title)

In case you came here after getting troubled by BeautifulSoup prettify(). I think this solution won't add extra spaces.

from lxml import html, etree

doc = html.fromstring(open('inputfile.html').read())
out = open('out.html', 'wb')
out.write(etree.tostring(doc))

Please see this Ian Bicking's answer on stackoverflow

Parsing via xml.etree is simple...

from xml.etree import ElementTree as ET
tree = ET.parse('out.html')
title = tree.find(".//title").text
print(title)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文