Python在cssselect和text_content()之后在lxml.html中保留换行符
在python中,使用lxml.html时如何保留段落(即保留换行符)?
例如,以下内容将剥离
标签并加入行,这不是我想要的:body = doc.cssselect("div.body")[0]
content = body.text_content()
这是我尝试过但不起作用的方法:
- lxml.html.clean.clean_html:
- 不会保留换行符。
- 内容.replace(" "*3,"\n\n"):
- 无法始终如一地工作,因为 组合文本不具有相同的 空格数。
In python, How do I preserve paragraphs (i.e. keep newlines) when using lxml.html?
For example, the following will strip <p></p> tags and join the lines, which is not what I want:
body = doc.cssselect("div.body")[0]
content = body.text_content()
Here's what I've tried that doesn't work:
- lxml.html.clean.clean_html:
- Won't preserve the newlines.
- content.replace(" "*3,"\n\n"):
- Doesn't work consistently, because
combined text does not have the same
number of spaces.
- Doesn't work consistently, because
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
lxml text_content 正在做根据文档应该做的事情,它正在剥离 html 标签并留下文本。
您可以通过在输出内容之前添加自己的换行符来解决此问题。
The lxml text_content is doing what is supposed to according to the docs, it is stripping the html tags and leaving the text behind.
You can fix this up by adding your own newlines before outputting the content.