如何在不提供凭据的情况下将安全的 rss feed 读取到 SyndicateFeed 中？

发布于 2024-08-31 19:20:59 字数 1710 浏览 16 评论 0原文

无论出于何种原因，IBM 使用 https（无需凭据）作为其 RSS 源。我正在尝试使用 https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en 以及 .NET 4 SyndicatedFeed。我可以在浏览器中打开此提要，它加载得很好。代码如下：

        using (XmlReader xml = XmlReader.Create("https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en"))
        {
            var items = from item in SyndicationFeed.Load(xml).Items
                        select item;
        }

例外情况如下：

System.Net.WebException was unhandled by user code
Message=The remote server returned an error: (500) Internal Server Error.
Source=System
StackTrace:
   at System.Net.HttpWebRequest.GetResponse()
   at System.Xml.XmlDownloadManager.GetNonFileStream(Uri uri, ICredentials credentials, IWebProxy proxy, RequestCachePolicy cachePolicy)
   at System.Xml.XmlDownloadManager.GetStream(Uri uri, ICredentials credentials, IWebProxy proxy, RequestCachePolicy cachePolicy)
   at System.Xml.XmlUrlResolver.GetEntity(Uri absoluteUri, String role, Type ofObjectToReturn)
   at System.Xml.XmlReaderSettings.CreateReader(String inputUri, XmlParserContext inputContext)
   at System.Xml.XmlReader.Create(String inputUri, XmlReaderSettings settings, XmlParserContext inputContext)
   at System.Xml.XmlReader.Create(String inputUri)
   at EDN.Util.Test.FeedAggTest.LoadFeedInfoTest() in D:\cdn\trunk\CDN\Dev\Shared\net\EDN.Util\EDN.Util.Test\FeedAggTest.cs:line 126

如何配置阅读器以使用 https 提要？

原文

For whatever reason, IBM uses https (without requiring credentials) for their RSS feeds. I'm trying to consume https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en with a .NET 4 SyndicationFeed. I can open this feed in a browser and it loads just fine. Here's the code:

        using (XmlReader xml = XmlReader.Create("https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en"))
        {
            var items = from item in SyndicationFeed.Load(xml).Items
                        select item;
        }

Here's the exception:

System.Net.WebException was unhandled by user code
Message=The remote server returned an error: (500) Internal Server Error.
Source=System
StackTrace:
   at System.Net.HttpWebRequest.GetResponse()
   at System.Xml.XmlDownloadManager.GetNonFileStream(Uri uri, ICredentials credentials, IWebProxy proxy, RequestCachePolicy cachePolicy)
   at System.Xml.XmlDownloadManager.GetStream(Uri uri, ICredentials credentials, IWebProxy proxy, RequestCachePolicy cachePolicy)
   at System.Xml.XmlUrlResolver.GetEntity(Uri absoluteUri, String role, Type ofObjectToReturn)
   at System.Xml.XmlReaderSettings.CreateReader(String inputUri, XmlParserContext inputContext)
   at System.Xml.XmlReader.Create(String inputUri, XmlReaderSettings settings, XmlParserContext inputContext)
   at System.Xml.XmlReader.Create(String inputUri)
   at EDN.Util.Test.FeedAggTest.LoadFeedInfoTest() in D:\cdn\trunk\CDN\Dev\Shared\net\EDN.Util\EDN.Util.Test\FeedAggTest.cs:line 126

How do I configure the reader to work with an https feed?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你是年少的欢喜 2024-09-07 19:20:59

我不认为这与安全有任何关系。 500 错误是服务器端错误。 XmlReader.Create(url) 生成的请求中的某些内容使 ibm 网站感到困惑。如果这只是一个安全问题，正如您的问题中所建议的，那么您可能会收到 403 错误或“授权被拒绝”。但你得到了 500，这是一个应用程序错误。

即便如此，也许客户端应用程序可以做一些事情，以避免混淆服务器。

我使用 Fiddler 查看了传出的 HTTP 请求标头。对于 IE 生成的请求，标头如下所示：

GET https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en HTTP/1.1
Accept: image/gif, image/jpeg, image/pjpeg, application/x-ms-application, application/vnd.ms-xpsdocument, application/xaml+xml, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-silverlight, application/x-shockwave-flash, application/x-silverlight-2-b2, */*
Accept-Language: en-us
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; .NET CLR 3.5.30729;)
Accept-Encoding: gzip, deflate
Host: www.ibm.com
Connection: Keep-Alive
Cookie: UnicaNIODID=Ww06gyvyPpZ-WPl6K7y; conxnsCookie=en; IBMPOLLCOOKIE=""; UnicaNIODID=QridYHCNf7M-WYM8Usr

对于来自 XmlReader.Create(url) 的请求，标头如下所示：

GET https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en HTTP/1.1
Host: www.ibm.com
Connection: Keep-Alive

差别很大。另外，在对后者的响应中，我在 500 响应中得到了一个 Set-Cookie 标头，该标头在对 IE 的响应中不存在。

基于此，我推测是请求标头（尤其是 cookie）的差异导致 ibm.com 感到困惑。

我不知道如何说服 XmlReader.Create() 嵌入我想要的所有请求标头，包括 cookie。但我知道如何使用 HttpWebRequest 来做到这一点。所以我用了那个。

我必须清除一些障碍。

我需要 ibm.com 的持久 cookie。为此，我不得不求助于 Win32 InternetGetCookie 的 ap/invoke 。请参阅文档页面底部的用户贡献内容中附加的 PersistentCookies 类，了解 WebRequest，了解如何执行此操作。附加 cookie 后，我不再收到 500 错误。万岁！
但是 XmlReader.Create() 无法读取生成的流。对我来说，它看起来是二元的。我意识到我需要解压缩 gzip 或压缩的内容。为此，我必须 ~~将 GZipStream 或 DeflateStream 包裹在收到的响应流周围，并使用 XmlReader 的解压缩流。~~ 设置 AutomaticDecompression HttpWebRequest 属性。我可以通过不在出站请求的 Accept-Encoding 标头中包含“gzip, deflate”来避免这种情况。实际上，设置AutomaticDecompression 属性后，这些标头会在出站HTTP 请求中隐式设置。
当我这样做时，我得到了实际的文本。但有些字节码是关闭的。接下来，我需要在 TextReader 中使用正确的文本编码，如 HttpWebResponse 中所示。
这样做之后，我得到了一个合理的字符串，但是生成的解压缩的 rss 流导致 XmlReader 阻塞，
ReadElementString 方法只能在内容简单或为空的元素上调用。第 11 行，位置 25。
我查看并在 rss 文档的元素内的该位置找到了一个小的

清除了这些障碍后，它奏效了。 .NET 应用程序能够从该 https URL 读取 RSS 流。

我没有做任何进一步的测试 - 看看改变 Accept 标头或 Accept-Encoding 标头是否会改变行为。如果你关心的话，那是你自己弄清楚的。

结果代码如下。它比简单的三线要丑得多。我不知道如何让它变得更简单。

public void Run()
{
    string url;
    url = "https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en";

    HttpWebRequest hwr = (HttpWebRequest) WebRequest.Create(url);
    // attach persistent cookies
    hwr.CookieContainer =
        PersistentCookies.GetCookieContainerForUrl(url);
    hwr.Accept = "text/xml, */*";
    hwr.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-us");
    hwr.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; .NET CLR 3.5.30729;)";
    hwr.KeepAlive = true;
    hwr.AutomaticDecompression = DecompressionMethods.Deflate |
                                 DecompressionMethods.GZip;

    using (var resp = (HttpWebResponse) hwr.GetResponse())
    {
        using(Stream s = resp.GetResponseStream())
        {            
            string cs = String.IsNullOrEmpty(resp.CharacterSet) ? "UTF-8" : resp.CharacterSet;
            Encoding e = Encoding.GetEncoding(cs);

            using (StreamReader sr = new StreamReader(s, e))
            {
                var allXml = sr.ReadToEnd();

                // remove any script blocks - they confuse XmlReader
                allXml = Regex.Replace( allXml,
                                        "(.*)<script type='text/javascript'>.+?</script>(.*)",
                                        "$1$2",
                                        RegexOptions.Singleline);

                using (XmlReader xmlr = XmlReader.Create(new StringReader(allXml)))
                {
                    var items = from item in SyndicationFeed.Load(xmlr).Items
                        select item;
                }
            }
        }
    }
}

I don't think it has anything to do with security. A 500 error is a server-side error. Something in the request generated by XmlReader.Create(url) is confusing the ibm website. If it was simply a security issue, as suggested in your question, then you'd expect to get a 403 error, or "Authorization Denied". But you got 500, which is an application error.

Even so, maybe there's something the client app can do, to avoid confusing the server.

I looked at the outgoing HTTP request headers, using Fiddler. For a request generated by IE, the headers look like this:

GET https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en HTTP/1.1
Accept: image/gif, image/jpeg, image/pjpeg, application/x-ms-application, application/vnd.ms-xpsdocument, application/xaml+xml, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-silverlight, application/x-shockwave-flash, application/x-silverlight-2-b2, */*
Accept-Language: en-us
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/4.0; .NET CLR 3.5.30729;)
Accept-Encoding: gzip, deflate
Host: www.ibm.com
Connection: Keep-Alive
Cookie: UnicaNIODID=Ww06gyvyPpZ-WPl6K7y; conxnsCookie=en; IBMPOLLCOOKIE=""; UnicaNIODID=QridYHCNf7M-WYM8Usr

For a request from XmlReader.Create(url), the headers look like this:

GET https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en HTTP/1.1
Host: www.ibm.com
Connection: Keep-Alive

Quite a difference. Also, in the response to the latter, I got a Set-Cookie header, in the 500 response, which wasn't present in the response to IE.

Based on that I theorized that it was the difference in request headers, in particular the cookie, that was confusing ibm.com.

I don't know how to convince XmlReader.Create() to embed all the request headers I wanted, including the cookie. But I know how to do that with an HttpWebRequest. So I used that.

There were a few hurdles I had to clear.

I needed the persistent cookie for ibm.com. For that I had to resort to a p/invoke of the Win32 InternetGetCookie. See the PersistentCookies class attached in the user-contributed content at the bottom of the doc page for WebRequest, for how to do that. After attaching the cookie, I was no longer getting 500 errors. Hooray!
But the resulting stream could not be read by XmlReader.Create(). It looked binary to me. I realized I needed to de-compress the gzip or deflated content. For that I had to ~~wrap a GZipStream or DeflateStream around the received response stream, and use the decompressing stream for XmlReader.~~ set the AutomaticDecompression property on HttpWebRequest. I could have avoided the need for this by not including "gzip, deflate" on the Accept-Encoding header in the outbound request. Actually, after setting the AutomaticDecompression property, those headers are set implicitly in the outbound HTTP Request.
When I did that, I got actual text. But some of the byte codes were off. Next I needed to use the proper text encoding in the TextReader, as indicated in the HttpWebResponse.
After doing that, I got a sensible string, but the resulting decompressed rss stream caused the XmlReader to choke, with
ReadElementString method can only be called on elements with simple or empty content. Line 11, position 25.
I looked and found a small <script> block, at that location, within the <copyright> element in the rss document. It seems IBM is trying to get the browser to "localize" the copyright date by attaching logic that would run in the browser to format the date. Seems like overkill to me, or even a bug by IBM. But because the angle bracket within the text node of an element bothered the XmlReader, I removed the script block with a Regex replace.

After clearing those hurdles, it worked. The .NET app was able to read the RSS stream from that https url.

I didn't do any further testing - to see if varying the Accept header or the Accept-Encoding header would change the behavior. That's for you to figure out, if you care.

The resulting code is below. It's much uglier than your simple 3-liner. I don't know how to make it any simpler.

public void Run()
{
    string url;
    url = "https://www.ibm.com/developerworks/mydeveloperworks/blogs/roller-ui/rendering/feed/gradybooch/entries/rss?lang=en";

    HttpWebRequest hwr = (HttpWebRequest) WebRequest.Create(url);
    // attach persistent cookies
    hwr.CookieContainer =
        PersistentCookies.GetCookieContainerForUrl(url);
    hwr.Accept = "text/xml, */*";
    hwr.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-us");
    hwr.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; .NET CLR 3.5.30729;)";
    hwr.KeepAlive = true;
    hwr.AutomaticDecompression = DecompressionMethods.Deflate |
                                 DecompressionMethods.GZip;

    using (var resp = (HttpWebResponse) hwr.GetResponse())
    {
        using(Stream s = resp.GetResponseStream())
        {            
            string cs = String.IsNullOrEmpty(resp.CharacterSet) ? "UTF-8" : resp.CharacterSet;
            Encoding e = Encoding.GetEncoding(cs);

            using (StreamReader sr = new StreamReader(s, e))
            {
                var allXml = sr.ReadToEnd();

                // remove any script blocks - they confuse XmlReader
                allXml = Regex.Replace( allXml,
                                        "(.*)<script type='text/javascript'>.+?</script>(.*)",
                                        "$1$2",
                                        RegexOptions.Singleline);

                using (XmlReader xmlr = XmlReader.Create(new StringReader(allXml)))
                {
                    var items = from item in SyndicationFeed.Load(xmlr).Items
                        select item;
                }
            }
        }
    }
}

回复收藏 0 原文

~没有更多了~