如何在不提供凭据的情况下将安全的 rss feed 读取到 SyndicateFeed 中?
无论出于何种原因,IBM 使用 https(无需凭据)作为其 RSS 源。我正在尝试使用 https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en 以及 .NET 4 SyndicatedFeed。我可以在浏览器中打开此提要,它加载得很好。代码如下:
using (XmlReader xml = XmlReader.Create("https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en"))
{
var items = from item in SyndicationFeed.Load(xml).Items
select item;
}
例外情况如下:
System.Net.WebException was unhandled by user code
Message=The remote server returned an error: (500) Internal Server Error.
Source=System
StackTrace:
at System.Net.HttpWebRequest.GetResponse()
at System.Xml.XmlDownloadManager.GetNonFileStream(Uri uri, ICredentials credentials, IWebProxy proxy, RequestCachePolicy cachePolicy)
at System.Xml.XmlDownloadManager.GetStream(Uri uri, ICredentials credentials, IWebProxy proxy, RequestCachePolicy cachePolicy)
at System.Xml.XmlUrlResolver.GetEntity(Uri absoluteUri, String role, Type ofObjectToReturn)
at System.Xml.XmlReaderSettings.CreateReader(String inputUri, XmlParserContext inputContext)
at System.Xml.XmlReader.Create(String inputUri, XmlReaderSettings settings, XmlParserContext inputContext)
at System.Xml.XmlReader.Create(String inputUri)
at EDN.Util.Test.FeedAggTest.LoadFeedInfoTest() in D:\cdn\trunk\CDN\Dev\Shared\net\EDN.Util\EDN.Util.Test\FeedAggTest.cs:line 126
如何配置阅读器以使用 https 提要?
For whatever reason, IBM uses https (without requiring credentials) for their RSS feeds. I'm trying to consume https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en with a .NET 4 SyndicationFeed. I can open this feed in a browser and it loads just fine. Here's the code:
using (XmlReader xml = XmlReader.Create("https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en"))
{
var items = from item in SyndicationFeed.Load(xml).Items
select item;
}
Here's the exception:
System.Net.WebException was unhandled by user code
Message=The remote server returned an error: (500) Internal Server Error.
Source=System
StackTrace:
at System.Net.HttpWebRequest.GetResponse()
at System.Xml.XmlDownloadManager.GetNonFileStream(Uri uri, ICredentials credentials, IWebProxy proxy, RequestCachePolicy cachePolicy)
at System.Xml.XmlDownloadManager.GetStream(Uri uri, ICredentials credentials, IWebProxy proxy, RequestCachePolicy cachePolicy)
at System.Xml.XmlUrlResolver.GetEntity(Uri absoluteUri, String role, Type ofObjectToReturn)
at System.Xml.XmlReaderSettings.CreateReader(String inputUri, XmlParserContext inputContext)
at System.Xml.XmlReader.Create(String inputUri, XmlReaderSettings settings, XmlParserContext inputContext)
at System.Xml.XmlReader.Create(String inputUri)
at EDN.Util.Test.FeedAggTest.LoadFeedInfoTest() in D:\cdn\trunk\CDN\Dev\Shared\net\EDN.Util\EDN.Util.Test\FeedAggTest.cs:line 126
How do I configure the reader to work with an https feed?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不认为这与安全有任何关系。 500 错误是服务器端错误。 XmlReader.Create(url) 生成的请求中的某些内容使 ibm 网站感到困惑。如果这只是一个安全问题,正如您的问题中所建议的,那么您可能会收到 403 错误或“授权被拒绝”。但你得到了 500,这是一个应用程序错误。
即便如此,也许客户端应用程序可以做一些事情,以避免混淆服务器。
我使用 Fiddler 查看了传出的 HTTP 请求标头。对于 IE 生成的请求,标头如下所示:
对于来自 XmlReader.Create(url) 的请求,标头如下所示:
差别很大。另外,在对后者的响应中,我在 500 响应中得到了一个
Set-Cookie
标头,该标头在对 IE 的响应中不存在。基于此,我推测是请求标头(尤其是 cookie)的差异导致 ibm.com 感到困惑。
我不知道如何说服 XmlReader.Create() 嵌入我想要的所有请求标头,包括 cookie。但我知道如何使用 HttpWebRequest 来做到这一点。所以我用了那个。
我必须清除一些障碍。
我需要 ibm.com 的持久 cookie。为此,我不得不求助于 Win32 InternetGetCookie 的 ap/invoke 。请参阅文档页面底部的用户贡献内容中附加的 PersistentCookies 类,了解 WebRequest,了解如何执行此操作。附加 cookie 后,我不再收到 500 错误。万岁!
但是 XmlReader.Create() 无法读取生成的流。对我来说,它看起来是二元的。我意识到我需要解压缩 gzip 或压缩的内容。为此,我必须
将 GZipStream 或 DeflateStream 包裹在收到的响应流周围,并使用 XmlReader 的解压缩流。设置 AutomaticDecompression HttpWebRequest 属性。我可以通过不在出站请求的 Accept-Encoding 标头中包含“gzip, deflate”来避免这种情况。实际上,设置AutomaticDecompression 属性后,这些标头会在出站HTTP 请求中隐式设置。当我这样做时,我得到了实际的文本。但有些字节码是关闭的。接下来,我需要在 TextReader 中使用正确的文本编码,如 HttpWebResponse 中所示。
这样做之后,我得到了一个合理的字符串,但是生成的解压缩的 rss 流导致 XmlReader 阻塞,
ReadElementString 方法只能在内容简单或为空的元素上调用。第 11 行,位置 25。
我查看并在 rss 文档的
元素内的该位置找到了一个小的块。 IBM 似乎正在尝试通过附加在浏览器中运行的逻辑来“本地化”版权日期以格式化日期。对我来说这似乎有点矫枉过正,甚至是 IBM 的一个错误。但由于元素文本节点内的尖括号干扰了 XmlReader,因此我使用 Regex 替换删除了脚本块。
清除了这些障碍后,它奏效了。 .NET 应用程序能够从该 https URL 读取 RSS 流。
我没有做任何进一步的测试 - 看看改变
Accept
标头或Accept-Encoding
标头是否会改变行为。如果你关心的话,那是你自己弄清楚的。结果代码如下。它比简单的三线要丑得多。我不知道如何让它变得更简单。
I don't think it has anything to do with security. A 500 error is a server-side error. Something in the request generated by XmlReader.Create(url) is confusing the ibm website. If it was simply a security issue, as suggested in your question, then you'd expect to get a 403 error, or "Authorization Denied". But you got 500, which is an application error.
Even so, maybe there's something the client app can do, to avoid confusing the server.
I looked at the outgoing HTTP request headers, using Fiddler. For a request generated by IE, the headers look like this:
For a request from XmlReader.Create(url), the headers look like this:
Quite a difference. Also, in the response to the latter, I got a
Set-Cookie
header, in the 500 response, which wasn't present in the response to IE.Based on that I theorized that it was the difference in request headers, in particular the cookie, that was confusing ibm.com.
I don't know how to convince XmlReader.Create() to embed all the request headers I wanted, including the cookie. But I know how to do that with an HttpWebRequest. So I used that.
There were a few hurdles I had to clear.
I needed the persistent cookie for ibm.com. For that I had to resort to a p/invoke of the Win32 InternetGetCookie. See the PersistentCookies class attached in the user-contributed content at the bottom of the doc page for WebRequest, for how to do that. After attaching the cookie, I was no longer getting 500 errors. Hooray!
But the resulting stream could not be read by XmlReader.Create(). It looked binary to me. I realized I needed to de-compress the gzip or deflated content. For that I had to
wrap a GZipStream or DeflateStream around the received response stream, and use the decompressing stream for XmlReader.set the AutomaticDecompression property on HttpWebRequest. I could have avoided the need for this by not including "gzip, deflate" on theAccept-Encoding
header in the outbound request. Actually, after setting the AutomaticDecompression property, those headers are set implicitly in the outbound HTTP Request.When I did that, I got actual text. But some of the byte codes were off. Next I needed to use the proper text encoding in the TextReader, as indicated in the HttpWebResponse.
After doing that, I got a sensible string, but the resulting decompressed rss stream caused the XmlReader to choke, with
ReadElementString method can only be called on elements with simple or empty content. Line 11, position 25.
I looked and found a small
<script>
block, at that location, within the<copyright>
element in the rss document. It seems IBM is trying to get the browser to "localize" the copyright date by attaching logic that would run in the browser to format the date. Seems like overkill to me, or even a bug by IBM. But because the angle bracket within the text node of an element bothered the XmlReader, I removed the script block with a Regex replace.After clearing those hurdles, it worked. The .NET app was able to read the RSS stream from that https url.
I didn't do any further testing - to see if varying the
Accept
header or theAccept-Encoding
header would change the behavior. That's for you to figure out, if you care.The resulting code is below. It's much uglier than your simple 3-liner. I don't know how to make it any simpler.