使用 Linq-to-XML 和 C# 读取 RSS 提要 - 如何解码 CDATA 部分?
我正在尝试使用 C# 和 Linq to XML 读取 RSS 提要。 Feed 采用 utf-8 编码(请参阅 http://pc03224.kr.hsnr.de /infosys/feed/),并且将其读出通常工作正常,但描述节点除外,因为它包含在 CDATA 部分中。
由于某种原因,在读出“描述”标签的内容后,我在调试器中看不到 CDATA 标签,但我猜它一定在某个地方,因为只有在本节中,德语元音变音 (äöü) 和其他特殊字符才不存在显示正确。相反,它们保留在 utf-8 编码的字符串中,如 ü
。
我能以某种方式正确读出它们或者至少在事后解码它们吗?
这是给我带来麻烦的 RSS 部分的示例:
<description><![CDATA[blabla bietet Hörern meiner Vorlesungen “IAS”, “WEB” und “SWE” an, Lizenzen für blabla [...]]]></description>
这是我的代码,它读取并解析 RSS 提要数据:
RssItems = (from xElem in xml.Descendants("channel").Descendants("item")
select new RssItem
{
Content = xElem.Descendants("description").FirstOrDefault().Value,
...
}).ToList();
提前致谢!
I am trying to read an RSS feed using C# and Linq to XML.
The feed is encoded in utf-8 (see http://pc03224.kr.hsnr.de/infosys/feed/) and reading it out generally works fine except for the description node because it is enclosed in a CDATA section.
For some reason I can't see the CDATA tag in the debugger after reading out the content of the "description" tag but I guess it must be there somewhere because only in this section the German Umlaute (äöü) and other special characters are not shown correctly. Instead they remain in the string utf-8 encoded like ü
.
Can I somehow read them out correctly or at least decode them afterwards?
This is a sample of the RSS section giving me troubles:
<description><![CDATA[blabla bietet Hörern meiner Vorlesungen “IAS”, “WEB” und “SWE” an, Lizenzen für blabla [...]]]></description>
Here is my code which reads out and parses the RSS feed data:
RssItems = (from xElem in xml.Descendants("channel").Descendants("item")
select new RssItem
{
Content = xElem.Descendants("description").FirstOrDefault().Value,
...
}).ToList();
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的代码正在按预期工作。 CDATA 部分意味着内容不应被解释,即
"ö"
不应被视为 HTML 实体,而应被视为字符序列。联系 RSS 提要的作者并告诉他修复它,方法是删除 CDATA 标记以便解释实体,或者将预期字符直接放入 HTML 文件中。
或者,查看 HttpUtility.HtmlDecode 来解码 CDATA 内容第二次。
Your code is working as intended. A CDATA section means that the contents should not be interpreted, i.e.
"ö"
should not be treated as an HTML entity but just as a sequence of characters.Contact the author of the RSS feed and tell him to fix it, either by removing the CDATA tags so the entities get interpreted, or by putting the intended characters directly into the HTML file.
Alternatively, have a look at HttpUtility.HtmlDecode to decode the CDATA contents a second time.