BeautifulSoup 内部html?

发布于 2024-12-14 23:58:41 字数 225 浏览 2 评论 0原文

假设我有一个带有 div 的页面。我可以使用soup.find()轻松获取该div。

现在我有了结果,我想打印该 div 的整个 innerhtml:我的意思是,我需要一个包含所有 html 标签和文本的字符串总而言之,就像我在 javascript 中使用 obj.innerHTML 得到的字符串一样。这可能吗?

Let's say I have a page with a div. I can easily get that div with soup.find().

Now that I have the result, I'd like to print the WHOLE innerhtml of that div: I mean, I'd need a string with ALL the html tags and text all toegether, exactly like the string I'd get in javascript with obj.innerHTML. Is this possible?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

北渚 2024-12-21 23:58:41

TL;DR

在 BeautifulSoup 4 中,如果您想要 UTF-8 编码的字节字符串,请使用 element.encode_contents() ;如果您想要 Python Unicode 字符串,请使用 element.decode_contents() 。例如,DOM 的 innerHTML 方法 可能如下所示:

def innerHTML(element):
    """Returns the inner HTML of an element as a UTF-8 encoded bytestring"""
    return element.encode_contents()

这些函数当前不在在线文档,因此我将引用当前函数定义和代码中的文档字符串。

encode_contents - 自 4.0.4 起

def encode_contents(
    self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
    formatter="minimal"):
    """Renders the contents of this tag as a bytestring.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param encoding: The bytestring will be in this encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

另请参阅 有关格式化程序的文档;您很可能会使用 formatter="minimal" (默认值)或 formatter="html" (对于 html 实体),除非您想以某种方式手动处理文本。

encode_contents 返回编码的字节串。如果您想要 Python Unicode 字符串,请使用 decode_contents 代替。


decode_contents - 从 4.0.1 开始

decode_contentsencode_contents 执行相同的操作,但返回 Python Unicode 字符串而不是编码的字节字符串。

def decode_contents(self, indent_level=None,
                   eventual_encoding=DEFAULT_OUTPUT_ENCODING,
                   formatter="minimal"):
    """Renders the contents of this tag as a Unicode string.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param eventual_encoding: The tag is destined to be
       encoded into this encoding. This method is _not_
       responsible for performing that encoding. This information
       is passed in so that it can be substituted in if the
       document contains a <META> tag that mentions the document's
       encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

BeautifulSoup 3

BeautifulSoup 3 没有上述功能,而是有 renderContents

def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
                   prettyPrint=False, indentLevel=0):
    """Renders the contents of this tag as a string in the given
    encoding. If encoding is None, returns a Unicode string.."""

这个功能被添加回 BeautifulSoup 4 (4.0.4 中)以实现兼容性与 BS3。

TL;DR

With BeautifulSoup 4 use element.encode_contents() if you want a UTF-8 encoded bytestring or use element.decode_contents() if you want a Python Unicode string. For example the DOM's innerHTML method might look something like this:

def innerHTML(element):
    """Returns the inner HTML of an element as a UTF-8 encoded bytestring"""
    return element.encode_contents()

These functions aren't currently in the online documentation so I'll quote the current function definitions and the doc string from the code.

encode_contents - since 4.0.4

def encode_contents(
    self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
    formatter="minimal"):
    """Renders the contents of this tag as a bytestring.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param encoding: The bytestring will be in this encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

See also the documentation on formatters; you'll most likely either use formatter="minimal" (the default) or formatter="html" (for html entities) unless you want to manually process the text in some way.

encode_contents returns an encoded bytestring. If you want a Python Unicode string then use decode_contents instead.


decode_contents - since 4.0.1

decode_contents does the same thing as encode_contents but returns a Python Unicode string instead of an encoded bytestring.

def decode_contents(self, indent_level=None,
                   eventual_encoding=DEFAULT_OUTPUT_ENCODING,
                   formatter="minimal"):
    """Renders the contents of this tag as a Unicode string.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param eventual_encoding: The tag is destined to be
       encoded into this encoding. This method is _not_
       responsible for performing that encoding. This information
       is passed in so that it can be substituted in if the
       document contains a <META> tag that mentions the document's
       encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

BeautifulSoup 3

BeautifulSoup 3 doesn't have the above functions, instead it has renderContents

def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
                   prettyPrint=False, indentLevel=0):
    """Renders the contents of this tag as a string in the given
    encoding. If encoding is None, returns a Unicode string.."""

This function was added back to BeautifulSoup 4 (in 4.0.4) for compatibility with BS3.

猫弦 2024-12-21 23:58:41

给定一个像

foobar

这样的 BS4 soup 元素,这里有一些不同的方法和可用于以不同方式检索其 HTML 和文本的属性以及它们将返回的内容的示例。


InnerHTML:

inner_html = element.encode_contents()

'<div id="inner">foobar</div>'

OuterHTML:

outer_html = str(element)

'<div id="outer"><div id="inner">foobar</div></div>'

OuterHTML(美化)

pretty_outer_html = element.prettify()

'''<div id="outer">
 <div id="inner">
  foobar
 </div>
</div>'''

仅文本(使用 .text):

element_text = element.text

'foobar'

仅文本(使用.string):

element_string = element.string

'foobar'

Given a BS4 soup element like <div id="outer"><div id="inner">foobar</div></div>, here are some various methods and attributes that can be used to retrieve its HTML and text in different ways along with an example of what they'll return.


InnerHTML:

inner_html = element.encode_contents()

'<div id="inner">foobar</div>'

OuterHTML:

outer_html = str(element)

'<div id="outer"><div id="inner">foobar</div></div>'

OuterHTML (prettified):

pretty_outer_html = element.prettify()

'''<div id="outer">
 <div id="inner">
  foobar
 </div>
</div>'''

Text only (using .text):

element_text = element.text

'foobar'

Text only (using .string):

element_string = element.string

'foobar'
征棹 2024-12-21 23:58:41

其中一个选项可以使用类似的东西:

 innerhtml = "".join([str(x) for x in div_element.contents]) 

One of the options could be use something like that:

 innerhtml = "".join([str(x) for x in div_element.contents]) 
淡淡の花香 2024-12-21 23:58:41

str(element) 帮助您获取 outerHTML,然后从外部 html 字符串中删除外部标签。

str(element) helps you to get outerHTML, then remove outer tag from the outer html string.

戴着白色围巾的女孩 2024-12-21 23:58:41

只使用 unicode(x) 怎么样?似乎对我有用。

编辑:这将为您提供外部 HTML,而不是内部 HTML。

How about just unicode(x)? Seems to work for me.

Edit: This will give you the outer HTML and not the inner.

蹲在坟头点根烟 2024-12-21 23:58:41

最简单的方法是使用 Children 属性。

inner_html = soup.find('body').children

它将返回一个列表。因此,您可以使用简单的 for 循环获得完整的代码。

for html in inner_html:
    print(html)

The easiest way is to use the children property.

inner_html = soup.find('body').children

it will return a list. So, you can get the full code using a simple for loop.

for html in inner_html:
    print(html)
亣腦蒛氧 2024-12-21 23:58:41

如果我没有误解的话,你的意思是对于这样的例子:

<div class="test">
    text in body
    <p>Hello World!</p>
</div>

输出应该是这样的:

text in body
    <p>Hello World!</p>

所以这是你的答案:

''.join(map(str,tag.contents))

If I do not misunderstand, you mean that for an example like this:

<div class="test">
    text in body
    <p>Hello World!</p>
</div>

the output should de look like this:

text in body
    <p>Hello World!</p>

So here is your answer:

''.join(map(str,tag.contents))
jJeQQOZ5 2024-12-21 23:58:41

对于纯文本,Beautiful Soup 4 get_text()

如果您只需要文档或标签内的人类可读文本,则可以使用 get_text() 方法。它返回文档中或标签下的所有文本,作为单个 Unicode 字符串:

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')

soup.get_text()
'\nI linked to example.com\n'
soup.i.get_text()
'example.com' 

您可以指定用于将文本位连接在一起的字符串:

soup.get_text("|")
'\nI linked to |example.com|\n' 

您可以告诉 Beautiful Soup 从每个文本的开头和结尾去除空格一些文本:

soup.get_text("|", strip=True)
'I linked to|example.com' 

但此时您可能想使用 .stripped_strings 生成器,并自己处理文本:

[text for text in soup.stripped_strings]
# ['I linked to', 'example.com'] 

从 Beautiful Soup 版本 4.9.0 开始,当 lxml或者html.parser 正在使用,

请参阅此处: https://www.crummy.com/software/ BeautifulSoup/bs4/doc/#get-text

For just text, Beautiful Soup 4 get_text()

If you only want the human-readable text inside a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')

soup.get_text()
'\nI linked to example.com\n'
soup.i.get_text()
'example.com' 

You can specify a string to be used to join the bits of text together:

soup.get_text("|")
'\nI linked to |example.com|\n' 

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:

soup.get_text("|", strip=True)
'I linked to|example.com' 

But at that point you might want to use the .stripped_strings generator instead, and process the text yourself:

[text for text in soup.stripped_strings]
# ['I linked to', 'example.com'] 

As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, <style>, and <template> tags are not considered to be ‘text’, since those tags are not part of the human-visible content of the page.

Refer here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文