防止 lxml 接触

我正在尝试编写一个 python 脚本来修改

<script>
<!--
...
-->
</script>

问题是当我尝试 scriptNode.text = '

看看大多数现代网站,似乎不需要这些评论标签。我可以删除它们,但许多脚本也在其中使用一些 html,如果这些脚本也被修改为 HTML 表示,那就是一个问题。

我很惊讶 lxml 竟然修改了这些数据,最后我听说 HTML 解析器旨在避免修改/解释

我可以使用一个设置/命令来防止这种情况发生吗?

谢谢

I'm trying to write a python script that modifies the contents of <script> tag in files I'm parsing. I'm using lxml.html (as opposed to BeautifulSoup, etc.) for this due to its speed. The contents of script tag are surrounded in comment tags (<!-- and -->):

<script>
<!--
...
-->
</script>

The problem is when I try something like scriptNode.text = '<!-- ... lxml modifies the angle brackets to their html representations (& lt; and & gt;) when I write the html back to file. I tried escaping them in the string ('\< ...'), but that doesn't seem to help.

Looking at most modern websites, it looks like those comment tags are not needed. I can remove them, but many of the scripts also use some html within them and if those get modified to their HTML representation as well, that's a problem.

I'm surprised that lxml is modifying this data at all, last I heard HTML parsers are designed to avoid modifying/interpreting data within <script> tags.

Is there a setting/command I can use to prevent this from happening?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

北凤男飞 2024-11-22 19:21:47

将它们放入 CDATA 部分

Put them in a CDATA section.

枕头说它不想醒 2024-11-22 19:21:47

我刚刚发现似乎也有效的另一种解决方案是使用 tostring() 而不是 write():

main = open('file.html', 'w')
main.write(lxml.html.tostring(htmlTree))
main.close()

而不是

htmlTree.write('file.html', pretty_print=False)

Figured,我也将其发布在这里,尽管我决定使用 CDATA,因为它似乎更干净解决方案也将防止将来其他解析脚本出现问题。

An alternative solution I just found that seems to work as well is using tostring() instead of write():

main = open('file.html', 'w')
main.write(lxml.html.tostring(htmlTree))
main.close()

instead of

htmlTree.write('file.html', pretty_print=False)

Figured I'd post it here as well, even though I decided to go with CDATA since it seems to be a cleaner solution that will prevent problems in the future with other parsing scripts as well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文