®转换为 ®在Python中解析XML时

发布于 2024-12-06 20:21:35 字数 494 浏览 0 评论 0原文

我的 RSS 提要包含:

<title><![ CDATA[HBO Wins 19 Emmy® Awards, The Most of Any Network This Year]]></title>

现在我正在解析 RSS,然后将标题分配给标题,如下所示:

 for item in XML.ElementFromURL(feed).xpath('//item',namespaces=NEWS_NS):
        title = item.find('title').text
        Log("Title :"+title)

当我检查输出或日志文件时,我会看到如下标题:

HBO 荣获 19 项艾美奖,是今年获得艾美奖最多的网络。

® 转换为 ® 。我尝试过使用 HTML 解析器但没有用。

My RSS feed ontains:

<title><![ CDATA[HBO Wins 19 Emmy® Awards, The Most of Any Network This Year]]></title>

Now I am parsing RSS and then assigning the title to title as below:

 for item in XML.ElementFromURL(feed).xpath('//item',namespaces=NEWS_NS):
        title = item.find('title').text
        Log("Title :"+title)

and when I am checking the out put or the log file then I see the title as below:

HBO Wins 19 Emmy® Awards, The Most of Any Network This Year.

® gets converted to ® . Any I tried using HTML parser but no use.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

双马尾 2024-12-13 20:21:35

您声明 feed 的编码是 ISO-8859-1。

在这种情况下,如果您所说的应解释为 ® 的字节实际上是 C2 AE,则文本真的,真的为 < code>Emmy® Awards,一切都按预期进行。如果发送者想要不同的文本,他们会发送不同的数据或设置不同的编码。

如果提要的编码为 UTF-8,并且通过线路发送的字节仍为 C2 AE,则文本将为 Emmy® Awards

如果 Feed 的编码为 ISO-8859-1,并且通过线路发送的字节只是 AE,没有 C2,则文本将为 艾美®奖

要确定字节是什么,请在 Unix 中使用 od -x 命令,或者在 Windows 中使用 debug.exe 中的 d 命令。在这种情况下不要相信记事本。它撒谎了。

You state that the encoding of the feed is ISO-8859-1.

In that case, if the bytes that you say should be interpreted as ® are in fact C2 AE, then the text really, truly is Emmy® Awards, and everything is working as it should. If the sender intended different text, they would have sent different data or set the encoding differently.

If the encoding of the feed were UTF-8, and the bytes sent over the wire were still C2 AE, then the text would be Emmy® Awards.

If the encoding of the feed were ISO-8859-1, and the bytes sent over the wire were simply AE, with no C2, then the text would be Emmy® Awards.

To be sure what the bytes are, use the od -x command in Unix or the d command in debug.exe for Windows. Don't trust Notepad in situations like this. It lies.

风向决定发型 2024-12-13 20:21:35

您收到了一些使用 UTF-8 编码的文本,但在某些时候这些字节被错误地解释为 ISO-8859-1 或其他编码。

如果没有更多上下文,就很难准确判断错误发生在哪里。您应该首先检查用于读取日志文件的编码。

You've received some text encoded using UTF-8, but at some point those bytes are being incorrectly interpreted as ISO-8859-1 or another encoding instead.

Without more context, it's difficult to tell exactly where the mistake is taking place. You should first check the encoding being used to read your log file.

樱桃奶球 2024-12-13 20:21:35

我尝试了以下操作并成功:

title = item.find('title').text
title = title.encode('iso-8859-1')

当我将字符串转换为 UTF-8(® 到 ® ) 并将其转换回 iso-8859-1(® 到 ® ) 并获得正确的输出时

I tried the following and worked:

title = item.find('title').text
title = title.encode('iso-8859-1')

When I am getting the string converted to UTF-8(® to ® ) and I am converting it back to iso-8859-1(® to ® ) and getting the correct output

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文