如何删除 BeautifulSoup 中的空格
我正在使用 BeautifulSoup 解析一堆 HTML,除了一个小问题外,一切进展顺利。我想将输出保存到单行字符串中,以下内容作为我当前的输出:
<li><span class="plaincharacterwrap break">
Zazzafooky but one two three!
</span></li>
<li><span class="plaincharacterwrap break">
Zazzafooky2
</span></li>
<li><span class="plaincharacterwrap break">
Zazzafooky3
</span></li>
理想情况下,我希望
<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>
有很多多余的空格我想删除,但不一定可以使用 删除strip()
,我也不能公然删除所有空格,因为我需要保留文本。我该怎么做呢?这似乎是一个很常见的问题,正则表达式会显得矫枉过正,但这是唯一的方法吗?
我没有任何
标签,所以我可以在那里更有力一点。
再次感谢!
I have a bunch of HTML I'm parsing with BeautifulSoup and it's been going pretty well except for one minor snag. I want to save the output into a single-lined string, with the following as my current output:
<li><span class="plaincharacterwrap break">
Zazzafooky but one two three!
</span></li>
<li><span class="plaincharacterwrap break">
Zazzafooky2
</span></li>
<li><span class="plaincharacterwrap break">
Zazzafooky3
</span></li>
Ideally I'd like
<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>
There's a lot of redundant whitespace that I'd like to get rid of but it's not necessarily removable using strip()
, nor can I blatantly remove all the spaces because I need to retain the text. How can I do it? It seems like a common enough problem that regex would be overkill, but is that the only way?
I don't have any <pre>
tags so I can be a little more forceful there.
Thanks once again!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
以下是不使用正则表达式的方法:
Here is how you can do it without regular expressions:
老问题,我知道,但是 beautifulsoup4 有一个名为 stripped_strings。
试试这个:
Old question, I know, but beautifulsoup4 has this helper called stripped_strings.
Try this:
当有两个或以上时,正则表达式
[\ \n]{2}
匹配换行符和空格(必须转义)。更彻底的实现是这样的:我认为第一个只会替换多个换行符,但它似乎(至少对我来说)工作得很好。
Regex
[\ \n]{2}
matches newlines and spaces (has to be escaped) when there's more than two or more of them. The more thorough implementation is this:I would think the first would only replace multiple newlines, but it seems (at least for me) to work just fine.
如果您在被 BeautifulSoup prettify() 困扰后来到这里。我认为这个解决方案不会添加额外的空格。
请参阅 Ian Bicking 在 stackoverflow 上的回答
通过 xml.etree 进行解析很简单...
In case you came here after getting troubled by BeautifulSoup prettify(). I think this solution won't add extra spaces.
Please see this Ian Bicking's answer on stackoverflow
Parsing via xml.etree is simple...