防止 lxml 接触
我正在尝试编写一个 python 脚本来修改
<script>
<!--
...
-->
</script>
问题是当我尝试 scriptNode.text = '
看看大多数现代网站,似乎不需要这些评论标签。我可以删除它们,但许多脚本也在其中使用一些 html,如果这些脚本也被修改为 HTML 表示,那就是一个问题。
我很惊讶 lxml 竟然修改了这些数据,最后我听说 HTML 解析器旨在避免修改/解释
我可以使用一个设置/命令来防止这种情况发生吗?
谢谢
I'm trying to write a python script that modifies the contents of <script> tag in files I'm parsing. I'm using lxml.html (as opposed to BeautifulSoup, etc.) for this due to its speed. The contents of script tag are surrounded in comment tags (<!-- and -->):
<script>
<!--
...
-->
</script>
The problem is when I try something like scriptNode.text = '<!-- ...
lxml modifies the angle brackets to their html representations (& lt; and & gt;) when I write the html back to file. I tried escaping them in the string ('\< ...'), but that doesn't seem to help.
Looking at most modern websites, it looks like those comment tags are not needed. I can remove them, but many of the scripts also use some html within them and if those get modified to their HTML representation as well, that's a problem.
I'm surprised that lxml is modifying this data at all, last I heard HTML parsers are designed to avoid modifying/interpreting data within <script> tags.
Is there a setting/command I can use to prevent this from happening?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
将它们放入 CDATA 部分。
Put them in a CDATA section.
我刚刚发现似乎也有效的另一种解决方案是使用 tostring() 而不是 write():
而不是
Figured,我也将其发布在这里,尽管我决定使用 CDATA,因为它似乎更干净解决方案也将防止将来其他解析脚本出现问题。
An alternative solution I just found that seems to work as well is using tostring() instead of write():
instead of
Figured I'd post it here as well, even though I decided to go with CDATA since it seems to be a cleaner solution that will prevent problems in the future with other parsing scripts as well.