防止 lxml 接触 <script> 中的数据标签

防止 lxml 接触

我正在尝试编写一个 python 脚本来修改

<script>
<!--
...
-->
</script>

问题是当我尝试 scriptNode.text = '

看看大多数现代网站，似乎不需要这些评论标签。我可以删除它们，但许多脚本也在其中使用一些 html，如果这些脚本也被修改为 HTML 表示，那就是一个问题。

我很惊讶 lxml 竟然修改了这些数据，最后我听说 HTML 解析器旨在避免修改/解释

我可以使用一个设置/命令来防止这种情况发生吗？

谢谢

原文

I'm trying to write a python script that modifies the contents of <script> tag in files I'm parsing. I'm using lxml.html (as opposed to BeautifulSoup, etc.) for this due to its speed. The contents of script tag are surrounded in comment tags ():

<script>
<!--
...
-->
</script>

The problem is when I try something like scriptNode.text = '<!-- ... lxml modifies the angle brackets to their html representations (& lt; and & gt;) when I write the html back to file. I tried escaping them in the string ('\< ...'), but that doesn't seem to help.

Looking at most modern websites, it looks like those comment tags are not needed. I can remove them, but many of the scripts also use some html within them and if those get modified to their HTML representation as well, that's a problem.

I'm surprised that lxml is modifying this data at all, last I heard HTML parsers are designed to avoid modifying/interpreting data within <script> tags.

Is there a setting/command I can use to prevent this from happening?

Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北凤男飞 2024-11-22 19:21:47

将它们放入 CDATA 部分。

回复收藏 0 原文

枕头说它不想醒 2024-11-22 19:21:47

我刚刚发现似乎也有效的另一种解决方案是使用 tostring() 而不是 write():

main = open('file.html', 'w')
main.write(lxml.html.tostring(htmlTree))
main.close()

而不是

htmlTree.write('file.html', pretty_print=False)

Figured，我也将其发布在这里，尽管我决定使用 CDATA，因为它似乎更干净解决方案也将防止将来其他解析脚本出现问题。

An alternative solution I just found that seems to work as well is using tostring() instead of write():

main = open('file.html', 'w')
main.write(lxml.html.tostring(htmlTree))
main.close()

instead of

htmlTree.write('file.html', pretty_print=False)

Figured I'd post it here as well, even though I decided to go with CDATA since it seems to be a cleaner solution that will prevent problems in the future with other parsing scripts as well.

回复收藏 0 原文

~没有更多了~