®转换为 ®在Python中解析XML时
我的 RSS 提要包含:
<title><![ CDATA[HBO Wins 19 Emmy® Awards, The Most of Any Network This Year]]></title>
现在我正在解析 RSS,然后将标题分配给标题,如下所示:
for item in XML.ElementFromURL(feed).xpath('//item',namespaces=NEWS_NS):
title = item.find('title').text
Log("Title :"+title)
当我检查输出或日志文件时,我会看到如下标题:
HBO 荣获 19 项艾美奖,是今年获得艾美奖最多的网络。
® 转换为 ® 。我尝试过使用 HTML 解析器但没有用。
My RSS feed ontains:
<title><![ CDATA[HBO Wins 19 Emmy® Awards, The Most of Any Network This Year]]></title>
Now I am parsing RSS and then assigning the title to title as below:
for item in XML.ElementFromURL(feed).xpath('//item',namespaces=NEWS_NS):
title = item.find('title').text
Log("Title :"+title)
and when I am checking the out put or the log file then I see the title as below:
HBO Wins 19 Emmy® Awards, The Most of Any Network This Year.
® gets converted to ® . Any I tried using HTML parser but no use.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您声明 feed 的编码是 ISO-8859-1。
在这种情况下,如果您所说的应解释为
®
的字节实际上是C2 AE,则文本真的,真的为 < code>Emmy® Awards
,一切都按预期进行。如果发送者想要不同的文本,他们会发送不同的数据或设置不同的编码。如果提要的编码为 UTF-8,并且通过线路发送的字节仍为
C2 AE,则文本将为
Emmy® Awards
。如果 Feed 的编码为 ISO-8859-1,并且通过线路发送的字节只是
AE
,没有C2
,则文本将为艾美®奖
。要确定字节是什么,请在 Unix 中使用 od -x 命令,或者在 Windows 中使用 debug.exe 中的 d 命令。在这种情况下不要相信记事本。它撒谎了。
You state that the encoding of the feed is ISO-8859-1.
In that case, if the bytes that you say should be interpreted as
®
are in factC2 AE
, then the text really, truly isEmmy® Awards
, and everything is working as it should. If the sender intended different text, they would have sent different data or set the encoding differently.If the encoding of the feed were UTF-8, and the bytes sent over the wire were still
C2 AE
, then the text would beEmmy® Awards
.If the encoding of the feed were ISO-8859-1, and the bytes sent over the wire were simply
AE
, with noC2
, then the text would beEmmy® Awards
.To be sure what the bytes are, use the
od -x
command in Unix or thed
command indebug.exe
for Windows. Don't trust Notepad in situations like this. It lies.您收到了一些使用 UTF-8 编码的文本,但在某些时候这些字节被错误地解释为 ISO-8859-1 或其他编码。
如果没有更多上下文,就很难准确判断错误发生在哪里。您应该首先检查用于读取日志文件的编码。
You've received some text encoded using UTF-8, but at some point those bytes are being incorrectly interpreted as ISO-8859-1 or another encoding instead.
Without more context, it's difficult to tell exactly where the mistake is taking place. You should first check the encoding being used to read your log file.
我尝试了以下操作并成功:
当我将字符串转换为 UTF-8(® 到 ® ) 并将其转换回 iso-8859-1(® 到 ® ) 并获得正确的输出时
I tried the following and worked:
When I am getting the string converted to UTF-8(® to ® ) and I am converting it back to iso-8859-1(® to ® ) and getting the correct output