如何防止美丽的小组编码逃脱的角色
正如该问题中提到的那样 - 我希望Beautifulsoup将原始角色保留在HTML中,而不是更换它们。简单示例:
soup1 = BeautifulSoup(
"APOLLOE4: Early Alzheimer's disease study",
"html.parser",
)
html1 = str(soup1)
print(html1)
当前输出:
"APOLLOE4: Early Alzheimer's disease study"
预期输出:
"APOLLOE4: Early Alzheimer's disease study"
我找到了“ BeautifusSoup Docs主题”,其中解释了输出格式化( https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highighlight = escape#output-forput-formatters ),),),但是我无法使它起作用。
As it's mentioned in the question - I want BeautifulSoup to keep original characters in HTML, instead of replacing them. Simple example:
soup1 = BeautifulSoup(
"APOLLOE4: Early Alzheimer's disease study",
"html.parser",
)
html1 = str(soup1)
print(html1)
Current output:
"APOLLOE4: Early Alzheimer's disease study"
Expected output:
"APOLLOE4: Early Alzheimer's disease study"
I've found the BeautifulSoup docs topic where output formatters are explained (https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=escape#output-formatters), but I can't make it work.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据官方文档,当使用美丽的解析时,文档(或字符串)首先转换为unicode,而HTML实体是始终转换为BS4的Unicode字符(参数
Convertentities
在BeautifulSoup 3中起作用的>不再起作用。)。因此,用
soup1 = BeautifulSoup(“ Apolloe4:早期的阿尔茨海默氏症'#39;#39;已转换。
并且编码无法预防。
我认为一种可能的解决方案是在汤制作之后将其转换回:
According to the official documentation, when using BeautifulSoup to parse, the document (or the string) is first converted to Unicode, and HTML entities are always converted to Unicode characters for BS4 (the argument
convertEntities
which works in BeautifulSoup 3 doesn't work anymore.).Therefore, after making the soup with
soup1 = BeautifulSoup("APOLLOE4: Early Alzheimer's disease study","html.parser")
,'
has been converted.And the encoding is unable to prevent.
I think a possible solution is to convert it back after the soup making like: