如何使用 Python 向现有 HTML 添加一致的空格?

发布于 2024-08-22 01:08:13 字数 493 浏览 9 评论 0原文

我刚刚开始在一个充满页面的网站上工作,所有的 HTML 都在一行上,这对于阅读和使用来说真的很痛苦。我正在寻找一个工具(最好是 Python 库),它可以接受 HTML 输入并返回相同的 HTML,除了添加换行符和适当的缩进之外。 (所有标签、标记和内容都应该保持不变。)

该库不必处理格式错误的 HTML;我首先通过 html5lib 传递 HTML,因此它将获得格式良好的 HTML 。然而,如上所述,我宁愿它不改变任何实际的标记本身;我信任 html5lib 并且宁愿让它处理正确性方面的问题。

首先,有谁知道仅使用 html5lib 是否可以实现这一点? (不幸的是,他们的文档似乎有点稀疏。)如果没有,您会建议使用什么工具?我看到有人推荐 HTML Tidy,但我不确定它是否可以配置为仅更改空白。 (如果首先传递的是格式良好的 HTML,除了插入空格之外,它还会做任何事情吗?)

I just started working on a website that is full of pages with all their HTML on a single line, which is a real pain to read and work with. I'm looking for a tool (preferably a Python library) that will take HTML input and return the same HTML unchanged, except for adding linebreaks and appropriate indentation. (All tags, markup, and content should be untouched.)

The library doesn't have to handle malformed HTML; I'm passing the HTML through html5lib first, so it will be getting well-formed HTML. However, as mentioned above, I would rather it didn't change any of the actual markup itself; I trust html5lib and would rather let it handle the correctness aspect.

First, does anyone know if this is possible with just html5lib? (Unfortunately, their documentation seems a bit sparse.) If not, what tool would you suggest? I've seen some people recommend HTML Tidy, but I'm not sure if it can be configured to only change whitespace. (Would it do anything except insert whitespace if it were passed well-formed HTML to start with?)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

情丝乱 2024-08-29 01:08:13

算法

  1. 将 html 解析为某种表示形式
  2. 将表示形式序列化回 html

示例 带有 BeautifulSoup 树构建器的 html5lib 解析器

#!/usr/bin/env python
from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

c = """<HTML><HEAD><TITLE>Title</TITLE></HEAD><BODY>...... </BODY></HTML>"""

soup = parser.parse(c)
print soup.prettify()

输出:

<html>
 <head>
  <title>
   Title
  </title>
 </head>
 <body>
  ......
 </body>
</html>

Algorithm

  1. Parse html into some representation
  2. Serialize the representation back to html

Example html5lib parser with BeautifulSoup tree builder

#!/usr/bin/env python
from html5lib import HTMLParser, treebuilders

parser = HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))

c = """<HTML><HEAD><TITLE>Title</TITLE></HEAD><BODY>...... </BODY></HTML>"""

soup = parser.parse(c)
print soup.prettify()

Output:

<html>
 <head>
  <title>
   Title
  </title>
 </head>
 <body>
  ......
 </body>
</html>
尽揽少女心 2024-08-29 01:08:13

我选择 JF Sebastian 的答案,因为我认为它是最简单的,因此也是最好的,但我为那些不想安装 Beautiful Soup 的人添加了另一个解决方案。 (此外,Beautiful Soup 树构建器将是 在 html5lib 1.0 中已弃用。)这个解决方案归功于 Amarghosh 的提示;我只是把它充实了一点。查看 html5lib,我意识到它会原生输出一个 minidom 对象,这意味着我可以使用他的 toprettyxml() 建议。这是我的想法:

from html5lib import HTMLParser, treebuilders
from cStringIO import StringIO

def tidy_html(text):
  """Returns a well-formatted version of input HTML."""

  p = HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
  dom_tree = p.parseFragment(text)

  # using cStringIO for fast string concatenation
  pretty_HTML = StringIO()

  node = dom_tree.firstChild
  while node:
    node_contents = node.toprettyxml(indent='  ')
    pretty_HTML.write(node_contents)
    node = node.nextSibling

  output = pretty_HTML.getvalue()
  pretty_HTML.close()
  return output

举个例子:

>>> text = """<b><i>bold, italic</b></i><div>a div</div>"""
>>> tidy_html(text)
<b>
  <i>
    bold, italic
  </i>
</b>
<div>
  a div
</div>

为什么我要迭代树的子节点,而不是直接在 dom_tree 上调用 toprettyxml() ?我正在处理的一些 HTML 实际上是 HTML 片段,因此它缺少 标记。为了处理这个问题,我使用了 parseFragment() 方法,这意味着我得到了一个 DocumentFragment 作为返回(而不是一个 Document)。不幸的是,它没有 writexml() 方法(toprettyxml() 调用该方法),因此我迭代具有该方法的子节点。

I chose J.F. Sebastian's answer because I think it's the simplest and thus the best, but I'm adding another solution for anyone who doesn't want to install Beautiful Soup. (Also, the Beautiful Soup tree builder is going to be deprecated in html5lib 1.0.) This solution was thanks to Amarghosh's tip; I just fleshed it out a bit. Looking at html5lib, I realized that it will output a minidom object natively, which means I can use his suggestion of toprettyxml(). Here's what I came up with:

from html5lib import HTMLParser, treebuilders
from cStringIO import StringIO

def tidy_html(text):
  """Returns a well-formatted version of input HTML."""

  p = HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
  dom_tree = p.parseFragment(text)

  # using cStringIO for fast string concatenation
  pretty_HTML = StringIO()

  node = dom_tree.firstChild
  while node:
    node_contents = node.toprettyxml(indent='  ')
    pretty_HTML.write(node_contents)
    node = node.nextSibling

  output = pretty_HTML.getvalue()
  pretty_HTML.close()
  return output

And an example:

>>> text = """<b><i>bold, italic</b></i><div>a div</div>"""
>>> tidy_html(text)
<b>
  <i>
    bold, italic
  </i>
</b>
<div>
  a div
</div>

Why am I iterating over the children of the tree, rather than just calling toprettyxml() on dom_tree directly? Some of the HTML I'm dealing with is actually HTML fragments, so it's missing the <head> and <body> tags. To handle this I used the parseFragment() method, which means I get a DocumentFragment in return (rather than a Document). Unfortunately, it doesn't have a writexml() method (which toprettyxml() calls), so I iterate over the child nodes, which do have the method.

蹲墙角沉默 2024-08-29 01:08:13

如果 html 确实是格式良好的 xml,则可以使用 DOM 解析器。

from xml.dom.minidom import parse, parseString

#if you have html string in a variable
html = parseString(theHtmlString)

#or parse the html file
html = parse(htmlFileName)

print html.toprettyxml()

toprettyxml()方法可以指定缩进、换行符和输出的编码。您可能想查看 writexml ()方法也。

If the html is indeed well formed xml, you can use DOM parser.

from xml.dom.minidom import parse, parseString

#if you have html string in a variable
html = parseString(theHtmlString)

#or parse the html file
html = parse(htmlFileName)

print html.toprettyxml()

The toprettyxml() method lets to specify the indent, new-line character and the encoding of the output. You might want to check out the writexml() method also.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文