如何检测页面是 RSS 还是 ATOM 提要

发布于 2024-08-25 06:28:36 字数 538 浏览 6 评论 0原文

我目前正在用 PHP 构建一个新的在线 Feed 阅读器。我正在开发的功能之一是提要自动发现。如果用户输入网站 URL，脚本将检测到它不是 feed，并通过解析 HTML 中正确的标记来查找真正的 feed URL。

问题是，我目前检测 URL 是否为提要或网站的方式仅在部分时间有效，而且我知道这不是最佳解决方案。现在，我正在获取 CURL 响应并通过 simplexml_load_string 运行它，如果它无法解析它，我会将其视为网站。这是代码。

$xml = @simplexml_load_string( $site_found['content'] );

if( !$xml ) // this is a website, not a feed
{
    // handle website
}
else
{
    // parse feed
}

显然，这并不理想。此外，当它遇到一个可以解析的 HTML 网站时，它会认为它是一个提要。

关于检测 PHP 中提要和非提要之间差异的好方法有什么建议吗？

原文

I'm currently building a new online Feed Reader in PHP. One of the features I'm working on is feed auto-discovery. If a user enters a website URL, the script will detect that its not a feed and look for the real feed URL by parsing the HTML for the proper <link> tag.

The problem is, the way I'm currently detecting if the URL is a feed or a website only works part of the time, and I know it can't be the best solution. Right now I'm taking the CURL response and running it through simplexml_load_string, if it can't parse it I treat it as a website. Here is the code.

$xml = @simplexml_load_string( $site_found['content'] );

if( !$xml ) // this is a website, not a feed
{
    // handle website
}
else
{
    // parse feed
}

Obviously, this isn't ideal. Also, when it runs into an HTML website that it can parse, it thinks its a feed.

Any suggestions on a good way of detecting the difference between a feed or non-feed in PHP?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

零度° 2024-09-01 06:28:36

我会嗅探这些格式具有的各种唯一标识符：

Atom: 来源

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

RSS 0.90：来源

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://my.netscape.com/rdf/simple/0.9/">

Netscape RSS 0.91

<rss version="0.91">

等（有关完整概述，请参阅第二个源链接）。

据我所知，通过分别查找和标签，分离 Atom 和 RSS 应该非常容易。另外，您不会在有效的 HTML 文档中找到这些内容。

您可以通过首先查找和元素来进行初步检查，以区分 HTML 和 Feed。为了避免无效输入的问题，这可能是使用正则表达式（通过解析器）最终合理的情况一次 :)

如果与 HTML 测试不匹配，请对其运行 Atom / RSS 测试。如果它未被识别为提要，或者 XML 解析器因无效输入而阻塞，则再次回退到 HTML。

在实际情况下，饲料供应商是否始终遵守这些规则是一个不同的问题，但您应该已经能够通过这种方式识别出很多东西。

I would sniff for the various unique identifiers those formats have:

Atom: Source

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

RSS 0.90: Source

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://my.netscape.com/rdf/simple/0.9/">

Netscape RSS 0.91

<rss version="0.91">

etc. etc. (See the 2nd source link for a full overview).

As far as I can see, separating Atom and RSS should be pretty easy by looking for <feed> and <rss> tags, respectively. Plus you won't find those in a valid HTML document.

You could make an initial check to tell HTML and feeds apart by looking for <html> and <body> elements first. To avoid problems with invalid input, this may be a case where using regular expressions (over a parser) is finally justified for once :)

If it doesn't match the HTML test, run the Atom / RSS tests on it. If it is not recognized as a feed, or the XML parser chokes on invalid input, fall back to HTML again.

what that looks like in the wild - whether feed providers always conform to those rules - is a different question, but you should already be able to recognize a lot this way.

回复收藏 0 原文

痴梦一场 2024-09-01 06:28:36

我认为您最好的选择是获取 Content-Type 标头，因为我认为 Firefox（或任何其他浏览器）就是这样做的。此外，如果你仔细想想，Content-Type确实是服务器告诉用户代理如何处理响应内容的方式。几乎所有像样的 HTTP 服务器都会发送正确的 Content-Type 标头。

不过，如果第一个选择“失败”，您可以尝试将内容中的 rss/atom 识别为第二选择（此标准取决于您）。

另一个好处是，您只需要请求标头而不是整个文档，从而节省带宽、时间等。您可以使用curl 来做到这一点，如下所示：

<?php
 $ch = curl_init("http://sample.com/feed");
 curl_setopt($ch, CURLOPT_NOBODY, true); // this set the HTTP Request Method to HEAD instead GET(default) and the server only sends HTTP Header(no content).
 curl_exec($ch);
 $conType = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

 if (is_rss($conType)){ // You need to implement is_rss($conType) function
    // TODO
 }elseif(is_html($conType)) { // You need to implement is_html($conType) function
    // Search a rss in html
 }else{
    // Error : Page has no rss/atom feed
 }
?>

I think your best choice is getting the Content-Type header as I assume that's the way firefox (or any other browser) does it. Besides, if you think about it, the Content-Type is indeed the way server tells user agents how to process the response content. Almost any decent HTTP server sends a correct Content-Type header.

Nevertheless you could try to identify rss/atom in the content as a second choice if the first one "fails"(this criteria is up to you).

An additional benefit is that you only need to request the header instead of the entire document, thus saving you bandwidth, time, etc. You can do this with curl like this:

<?php
 $ch = curl_init("http://sample.com/feed");
 curl_setopt($ch, CURLOPT_NOBODY, true); // this set the HTTP Request Method to HEAD instead GET(default) and the server only sends HTTP Header(no content).
 curl_exec($ch);
 $conType = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

 if (is_rss($conType)){ // You need to implement is_rss($conType) function
    // TODO
 }elseif(is_html($conType)) { // You need to implement is_html($conType) function
    // Search a rss in html
 }else{
    // Error : Page has no rss/atom feed
 }
?>

回复收藏 0 原文