当前位置：文江博客话题详情

自动从网页中提取提要链接（atom、rss 等）

发布于 2024-12-12 03:50:47 字数 1539 浏览 0 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笑咖 2024-12-19 03:50:47

有feedfinder：

>>> import feedfinder
>>>
>>> feedfinder.feed('scripting.com')
'http://scripting.com/rss.xml'
>>>
>>> feedfinder.feeds('scripting.com')
['http://delong.typepad.com/sdj/atom.xml', 
 'http://delong.typepad.com/sdj/index.rdf', 
 'http://delong.typepad.com/sdj/rss.xml']
>>>

There's feedfinder:

>>> import feedfinder
>>>
>>> feedfinder.feed('scripting.com')
'http://scripting.com/rss.xml'
>>>
>>> feedfinder.feeds('scripting.com')
['http://delong.typepad.com/sdj/atom.xml', 
 'http://delong.typepad.com/sdj/index.rdf', 
 'http://delong.typepad.com/sdj/rss.xml']
>>>

回复收藏 0 原文

江湖彼岸 2024-12-19 03:50:47

我第二个华夫饼悖论推荐 Beautiful Soup 来解析 HTML，然后获取 Beautiful Soup链接 rel="备用">标签，其中引用提要。我通常使用的代码：

from BeautifulSoup import BeautifulSoup as parser

def detect_feeds_in_HTML(input_stream):
    """ examines an open text stream with HTML for referenced feeds.

    This is achieved by detecting all ``link`` tags that reference a feed in HTML.

    :param input_stream: an arbitrary opened input stream that has a :func:`read` method.
    :type input_stream: an input stream (e.g. open file or URL)
    :return: a list of tuples ``(url, feed_type)``
    :rtype: ``list(tuple(str, str))``
    """
    # check if really an input stream
    if not hasattr(input_stream, "read"):
        raise TypeError("An opened input *stream* should be given, was %s instead!" % type(input_stream))
    result = []
    # get the textual data (the HTML) from the input stream
    html = parser(input_stream.read())
    # find all links that have an "alternate" attribute
    feed_urls = html.findAll("link", rel="alternate")
    # extract URL and type
    for feed_link in feed_urls:
        url = feed_link.get("href", None)
        # if a valid URL is there
        if url:
            result.append(url)
    return result

I second waffle paradox in recommending Beautiful Soup for parsing the HTML and then getting the <link rel="alternate"> tags, where the feeds are referenced. The code I usually use:

from BeautifulSoup import BeautifulSoup as parser

def detect_feeds_in_HTML(input_stream):
    """ examines an open text stream with HTML for referenced feeds.

    This is achieved by detecting all ``link`` tags that reference a feed in HTML.

    :param input_stream: an arbitrary opened input stream that has a :func:`read` method.
    :type input_stream: an input stream (e.g. open file or URL)
    :return: a list of tuples ``(url, feed_type)``
    :rtype: ``list(tuple(str, str))``
    """
    # check if really an input stream
    if not hasattr(input_stream, "read"):
        raise TypeError("An opened input *stream* should be given, was %s instead!" % type(input_stream))
    result = []
    # get the textual data (the HTML) from the input stream
    html = parser(input_stream.read())
    # find all links that have an "alternate" attribute
    feed_urls = html.findAll("link", rel="alternate")
    # extract URL and type
    for feed_link in feed_urls:
        url = feed_link.get("href", None)
        # if a valid URL is there
        if url:
            result.append(url)
    return result

回复收藏 0 原文

半山落雨半山空 2024-12-19 03:50:47

我不知道任何现有的库，但 Atom 或 RSS 提要通常在部分中用标记表示，如下所示

<link rel="alternative" type="application/rss+xml" href="http://link.to/feed">
<link rel="alternative" type="application/atom+xml" href="http://link.to/feed">

：将使用 lxml.html 之类的 HTML 解析器下载并解析这些 URL，并获取相关标记的 href 属性。

I don't know any existing library, but Atom or RSS feeds are usually indicated with a <link> tag in the <head> section as such:

<link rel="alternative" type="application/rss+xml" href="http://link.to/feed">
<link rel="alternative" type="application/atom+xml" href="http://link.to/feed">

Straightforward way would be downloading and parsing these URL's with an HTML parser like lxml.html and getting the href attribute of relevant <link> tags.

回复收藏 0 原文

安人多梦 2024-12-19 03:50:47

取决于这些提要中信息的格式是否良好（例如，所有链接都采用 http://.../ 的形式吗？您知道它们是否都采用 >href 或 link 标签？提要中的所有链接都将指向其他提要吗？等等），我会推荐从简单的正则表达式到直接解析的任何内容。从提要中提取链接的模块。

就解析模块而言，我只能推荐漂亮的汤。尽管即使是最好的解析器也只能做到这一点——尤其是在我上面提到的情况下，如果您不能保证数据中的所有链接都将是指向其他提要的链接；那么你必须自己做一些额外的爬行和探测。

回复收藏 0 原文

~没有更多了~

关于作者

醉生梦死

暂无简介

0 文章

0 评论

388 人气

关注发私信

友情链接

文江博客

自动从网页中提取提要链接（atom、rss 等）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

自动从网页中提取提要链接（atom、rss 等）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。