BeautifulSoup 内部html？

发布于 2024-12-14 23:58:41 字数 225 浏览 2 评论 0原文

假设我有一个带有 div 的页面。我可以使用soup.find()轻松获取该div。

现在我有了结果，我想打印该 div 的整个 innerhtml：我的意思是，我需要一个包含所有 html 标签和文本的字符串总而言之，就像我在 javascript 中使用 obj.innerHTML 得到的字符串一样。这可能吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北渚 2024-12-21 23:58:41

TL;DR

在 BeautifulSoup 4 中，如果您想要 UTF-8 编码的字节字符串，请使用 element.encode_contents() ；如果您想要 Python Unicode 字符串，请使用 element.decode_contents() 。例如，DOM 的 innerHTML 方法可能如下所示：

def innerHTML(element):
    """Returns the inner HTML of an element as a UTF-8 encoded bytestring"""
    return element.encode_contents()

这些函数当前不在在线文档，因此我将引用当前函数定义和代码中的文档字符串。

`encode_contents` - 自 4.0.4 起

def encode_contents(
    self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
    formatter="minimal"):
    """Renders the contents of this tag as a bytestring.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param encoding: The bytestring will be in this encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

另请参阅有关格式化程序的文档；您很可能会使用 formatter="minimal" （默认值）或 formatter="html" （对于 html 实体），除非您想以某种方式手动处理文本。

encode_contents 返回编码的字节串。如果您想要 Python Unicode 字符串，请使用 decode_contents 代替。

`decode_contents` - 从 4.0.1 开始

decode_contents 与 encode_contents 执行相同的操作，但返回 Python Unicode 字符串而不是编码的字节字符串。

def decode_contents(self, indent_level=None,
                   eventual_encoding=DEFAULT_OUTPUT_ENCODING,
                   formatter="minimal"):
    """Renders the contents of this tag as a Unicode string.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param eventual_encoding: The tag is destined to be
       encoded into this encoding. This method is _not_
       responsible for performing that encoding. This information
       is passed in so that it can be substituted in if the
       document contains a <META> tag that mentions the document's
       encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

BeautifulSoup 3

BeautifulSoup 3 没有上述功能，而是有 renderContents

def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
                   prettyPrint=False, indentLevel=0):
    """Renders the contents of this tag as a string in the given
    encoding. If encoding is None, returns a Unicode string.."""

这个功能被添加回 BeautifulSoup 4 (4.0.4 中）以实现兼容性与 BS3。

TL;DR

With BeautifulSoup 4 use element.encode_contents() if you want a UTF-8 encoded bytestring or use element.decode_contents() if you want a Python Unicode string. For example the DOM's innerHTML method might look something like this:

def innerHTML(element):
    """Returns the inner HTML of an element as a UTF-8 encoded bytestring"""
    return element.encode_contents()

These functions aren't currently in the online documentation so I'll quote the current function definitions and the doc string from the code.

`encode_contents` - since 4.0.4

def encode_contents(
    self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
    formatter="minimal"):
    """Renders the contents of this tag as a bytestring.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param encoding: The bytestring will be in this encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

See also the documentation on formatters; you'll most likely either use formatter="minimal" (the default) or formatter="html" (for html entities) unless you want to manually process the text in some way.

encode_contents returns an encoded bytestring. If you want a Python Unicode string then use decode_contents instead.

`decode_contents` - since 4.0.1

decode_contents does the same thing as encode_contents but returns a Python Unicode string instead of an encoded bytestring.

def decode_contents(self, indent_level=None,
                   eventual_encoding=DEFAULT_OUTPUT_ENCODING,
                   formatter="minimal"):
    """Renders the contents of this tag as a Unicode string.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param eventual_encoding: The tag is destined to be
       encoded into this encoding. This method is _not_
       responsible for performing that encoding. This information
       is passed in so that it can be substituted in if the
       document contains a <META> tag that mentions the document's
       encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

BeautifulSoup 3

BeautifulSoup 3 doesn't have the above functions, instead it has renderContents

def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
                   prettyPrint=False, indentLevel=0):
    """Renders the contents of this tag as a string in the given
    encoding. If encoding is None, returns a Unicode string.."""

This function was added back to BeautifulSoup 4 (in 4.0.4) for compatibility with BS3.

回复收藏 0 原文

猫弦 2024-12-21 23:58:41

给定一个像

foobar

这样的 BS4 soup 元素，这里有一些不同的方法和可用于以不同方式检索其 HTML 和文本的属性以及它们将返回的内容的示例。

InnerHTML：

inner_html = element.encode_contents()

'<div id="inner">foobar</div>'

OuterHTML：

outer_html = str(element)

'<div id="outer"><div id="inner">foobar</div></div>'

OuterHTML（美化）：

pretty_outer_html = element.prettify()

'''<div id="outer">
 <div id="inner">
  foobar
 </div>
</div>'''

仅文本（使用 .text）：

element_text = element.text

'foobar'

仅文本（使用.string）：

element_string = element.string

'foobar'

Given a BS4 soup element like <div id="outer"><div id="inner">foobar</div></div>, here are some various methods and attributes that can be used to retrieve its HTML and text in different ways along with an example of what they'll return.

InnerHTML:

inner_html = element.encode_contents()

'<div id="inner">foobar</div>'

OuterHTML:

outer_html = str(element)

'<div id="outer"><div id="inner">foobar</div></div>'

OuterHTML (prettified):

pretty_outer_html = element.prettify()

'''<div id="outer">
 <div id="inner">
  foobar
 </div>
</div>'''

Text only (using .text):

element_text = element.text

'foobar'

Text only (using .string):

element_string = element.string

'foobar'

回复收藏 0 原文

征棹 2024-12-21 23:58:41

其中一个选项可以使用类似的东西：

 innerhtml = "".join([str(x) for x in div_element.contents])

One of the options could be use something like that:

 innerhtml = "".join([str(x) for x in div_element.contents])

回复收藏 0 原文

淡淡の花香 2024-12-21 23:58:41

str(element) 帮助您获取 outerHTML，然后从外部 html 字符串中删除外部标签。

回复收藏 0 原文

戴着白色围巾的女孩 2024-12-21 23:58:41

只使用 unicode(x) 怎么样？似乎对我有用。

编辑：这将为您提供外部 HTML，而不是内部 HTML。

回复收藏 0 原文

蹲在坟头点根烟 2024-12-21 23:58:41

最简单的方法是使用 Children 属性。

inner_html = soup.find('body').children

它将返回一个列表。因此，您可以使用简单的 for 循环获得完整的代码。

for html in inner_html:
    print(html)

The easiest way is to use the children property.

inner_html = soup.find('body').children

it will return a list. So, you can get the full code using a simple for loop.

for html in inner_html:
    print(html)

回复收藏 0 原文

亣腦蒛氧 2024-12-21 23:58:41

如果我没有误解的话，你的意思是对于这样的例子：

<div class="test">
    text in body
    <p>Hello World!</p>
</div>

输出应该是这样的：

text in body
    <p>Hello World!</p>

所以这是你的答案：

''.join(map(str,tag.contents))

If I do not misunderstand, you mean that for an example like this:

<div class="test">
    text in body
    <p>Hello World!</p>
</div>

the output should de look like this:

text in body
    <p>Hello World!</p>

So here is your answer:

''.join(map(str,tag.contents))

回复收藏 0 原文

jJeQQOZ5 2024-12-21 23:58:41

对于纯文本，Beautiful Soup 4 `get_text()`

如果您只需要文档或标签内的人类可读文本，则可以使用 get_text() 方法。它返回文档中或标签下的所有文本，作为单个 Unicode 字符串：

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')

soup.get_text()
'\nI linked to example.com\n'
soup.i.get_text()
'example.com'

您可以指定用于将文本位连接在一起的字符串：

soup.get_text("|")
'\nI linked to |example.com|\n'

您可以告诉 Beautiful Soup 从每个文本的开头和结尾去除空格一些文本：

soup.get_text("|", strip=True)
'I linked to|example.com'

但此时您可能想使用 .stripped_strings 生成器，并自己处理文本：

[text for text in soup.stripped_strings]
# ['I linked to', 'example.com']

从 Beautiful Soup 版本 4.9.0 开始，当 lxml或者html.parser 正在使用，

请参阅此处： https://www.crummy.com/software/ BeautifulSoup/bs4/doc/#get-text

For just text, Beautiful Soup 4 `get_text()`

If you only want the human-readable text inside a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')

soup.get_text()
'\nI linked to example.com\n'
soup.i.get_text()
'example.com'

You can specify a string to be used to join the bits of text together:

soup.get_text("|")
'\nI linked to |example.com|\n'

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:

soup.get_text("|", strip=True)
'I linked to|example.com'

But at that point you might want to use the .stripped_strings generator instead, and process the text yourself:

[text for text in soup.stripped_strings]
# ['I linked to', 'example.com']

As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, <style>, and <template> tags are not considered to be ‘text’, since those tags are not part of the human-visible content of the page.

Refer here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

回复收藏 0 原文

~没有更多了~

关于作者

南汐寒笙箫

暂无简介

文章

604 人气

关注发私信

友情链接

文江博客

BeautifulSoup 内部html？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

TL;DR

`encode_contents` - 自 4.0.4 起

`decode_contents` - 从 4.0.1 开始

BeautifulSoup 3

TL;DR

`encode_contents` - since 4.0.4

`decode_contents` - since 4.0.1

BeautifulSoup 3

对于纯文本，Beautiful Soup 4 `get_text()`

For just text, Beautiful Soup 4 `get_text()`

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

BeautifulSoup 内部html？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

TL;DR

encode_contents - 自 4.0.4 起

decode_contents - 从 4.0.1 开始

BeautifulSoup 3

TL;DR

encode_contents - since 4.0.4

decode_contents - since 4.0.1

BeautifulSoup 3

对于纯文本，Beautiful Soup 4 get_text()

For just text, Beautiful Soup 4 get_text()

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

`encode_contents` - 自 4.0.4 起

`decode_contents` - 从 4.0.1 开始

`encode_contents` - since 4.0.4

`decode_contents` - since 4.0.1

对于纯文本，Beautiful Soup 4 `get_text()`

For just text, Beautiful Soup 4 `get_text()`