如何防止美丽的小组编码逃脱的角色

发布于 2025-02-08 11:18:30 字数 759 浏览 1 评论 0原文

正如该问题中提到的那样 - 我希望Beautifulsoup将原始角色保留在HTML中,而不是更换它们。简单示例:

soup1 = BeautifulSoup(
    "APOLLOE4: Early Alzheimer's disease study",
    "html.parser",
)
html1 = str(soup1)
print(html1)

当前输出:

"APOLLOE4: Early Alzheimer's disease study"

预期输出:

"APOLLOE4: Early Alzheimer's disease study"

我找到了“ BeautifusSoup Docs主题”,其中解释了输出格式化( https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highighlight = escape#output-forput-formatters ),),),但是我无法使它起作用。

As it's mentioned in the question - I want BeautifulSoup to keep original characters in HTML, instead of replacing them. Simple example:

soup1 = BeautifulSoup(
    "APOLLOE4: Early Alzheimer's disease study",
    "html.parser",
)
html1 = str(soup1)
print(html1)

Current output:

"APOLLOE4: Early Alzheimer's disease study"

Expected output:

"APOLLOE4: Early Alzheimer's disease study"

I've found the BeautifulSoup docs topic where output formatters are explained (https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=escape#output-formatters), but I can't make it work.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

忆梦 2025-02-15 11:18:30

根据官方文档,当使用美丽的解析时,文档(或字符串)首先转换为unicode,而HTML实体是始终转换为BS4的Unicode字符(参数Convertentities在BeautifulSoup 3中起作用的>不再起作用。)。

因此,用soup1 = BeautifulSoup(“ Apolloe4:早期的阿尔茨海默氏症'#39;#39;已转换。
并且编码无法预防。

我认为一种可能的解决方案是在汤制作之后将其转换回:

html1 = str(soup1).replace("'", "'")

According to the official documentation, when using BeautifulSoup to parse, the document (or the string) is first converted to Unicode, and HTML entities are always converted to Unicode characters for BS4 (the argument convertEntities which works in BeautifulSoup 3 doesn't work anymore.).

Therefore, after making the soup with soup1 = BeautifulSoup("APOLLOE4: Early Alzheimer's disease study","html.parser"), ' has been converted.
And the encoding is unable to prevent.

I think a possible solution is to convert it back after the soup making like:

html1 = str(soup1).replace("'", "'")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文