如何使用 Python 向现有 HTML 添加一致的空格?
我刚刚开始在一个充满页面的网站上工作,所有的 HTML 都在一行上,这对于阅读和使用来说真的很痛苦。我正在寻找一个工具(最好是 Python 库),它可以接受 HTML 输入并返回相同的 HTML,除了添加换行符和适当的缩进之外。 (所有标签、标记和内容都应该保持不变。)
该库不必处理格式错误的 HTML;我首先通过 html5lib 传递 HTML,因此它将获得格式良好的 HTML 。然而,如上所述,我宁愿它不改变任何实际的标记本身;我信任 html5lib 并且宁愿让它处理正确性方面的问题。
首先,有谁知道仅使用 html5lib 是否可以实现这一点? (不幸的是,他们的文档似乎有点稀疏。)如果没有,您会建议使用什么工具?我看到有人推荐 HTML Tidy,但我不确定它是否可以配置为仅更改空白。 (如果首先传递的是格式良好的 HTML,除了插入空格之外,它还会做任何事情吗?)
I just started working on a website that is full of pages with all their HTML on a single line, which is a real pain to read and work with. I'm looking for a tool (preferably a Python library) that will take HTML input and return the same HTML unchanged, except for adding linebreaks and appropriate indentation. (All tags, markup, and content should be untouched.)
The library doesn't have to handle malformed HTML; I'm passing the HTML through html5lib first, so it will be getting well-formed HTML. However, as mentioned above, I would rather it didn't change any of the actual markup itself; I trust html5lib and would rather let it handle the correctness aspect.
First, does anyone know if this is possible with just html5lib? (Unfortunately, their documentation seems a bit sparse.) If not, what tool would you suggest? I've seen some people recommend HTML Tidy, but I'm not sure if it can be configured to only change whitespace. (Would it do anything except insert whitespace if it were passed well-formed HTML to start with?)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
算法
示例 带有 BeautifulSoup 树构建器的 html5lib 解析器
输出:
Algorithm
Example html5lib parser with BeautifulSoup tree builder
Output:
我选择 JF Sebastian 的答案,因为我认为它是最简单的,因此也是最好的,但我为那些不想安装 Beautiful Soup 的人添加了另一个解决方案。 (此外,Beautiful Soup 树构建器将是 在 html5lib 1.0 中已弃用。)这个解决方案归功于 Amarghosh 的提示;我只是把它充实了一点。查看 html5lib,我意识到它会原生输出一个 minidom 对象,这意味着我可以使用他的
toprettyxml()
建议。这是我的想法:举个例子:
为什么我要迭代树的子节点,而不是直接在
dom_tree
上调用toprettyxml()
?我正在处理的一些 HTML 实际上是 HTML 片段,因此它缺少和
标记。为了处理这个问题,我使用了 parseFragment() 方法,这意味着我得到了一个 DocumentFragment 作为返回(而不是一个 Document)。不幸的是,它没有
writexml()
方法(toprettyxml()
调用该方法),因此我迭代具有该方法的子节点。I chose J.F. Sebastian's answer because I think it's the simplest and thus the best, but I'm adding another solution for anyone who doesn't want to install Beautiful Soup. (Also, the Beautiful Soup tree builder is going to be deprecated in html5lib 1.0.) This solution was thanks to Amarghosh's tip; I just fleshed it out a bit. Looking at html5lib, I realized that it will output a minidom object natively, which means I can use his suggestion of
toprettyxml()
. Here's what I came up with:And an example:
Why am I iterating over the children of the tree, rather than just calling
toprettyxml()
ondom_tree
directly? Some of the HTML I'm dealing with is actually HTML fragments, so it's missing the<head>
and<body>
tags. To handle this I used theparseFragment()
method, which means I get a DocumentFragment in return (rather than a Document). Unfortunately, it doesn't have awritexml()
method (whichtoprettyxml()
calls), so I iterate over the child nodes, which do have the method.如果 html 确实是格式良好的 xml,则可以使用 DOM 解析器。
toprettyxml()方法可以指定缩进、换行符和输出的编码。您可能想查看 writexml ()方法也。
If the html is indeed well formed xml, you can use DOM parser.
The toprettyxml() method lets to specify the indent, new-line character and the encoding of the output. You might want to check out the writexml() method also.