自动从网页中提取提要链接(atom、rss 等)

发布于 2024-12-12 03:50:47 字数 1539 浏览 0 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

笑咖 2024-12-19 03:50:47

feedfinder

>>> import feedfinder
>>>
>>> feedfinder.feed('scripting.com')
'http://scripting.com/rss.xml'
>>>
>>> feedfinder.feeds('scripting.com')
['http://delong.typepad.com/sdj/atom.xml', 
 'http://delong.typepad.com/sdj/index.rdf', 
 'http://delong.typepad.com/sdj/rss.xml']
>>>

There's feedfinder:

>>> import feedfinder
>>>
>>> feedfinder.feed('scripting.com')
'http://scripting.com/rss.xml'
>>>
>>> feedfinder.feeds('scripting.com')
['http://delong.typepad.com/sdj/atom.xml', 
 'http://delong.typepad.com/sdj/index.rdf', 
 'http://delong.typepad.com/sdj/rss.xml']
>>>
江湖彼岸 2024-12-19 03:50:47

我第二个华夫饼悖论推荐 Beautiful Soup 来解析 HTML,然后获取 Beautiful Soup链接 rel="备用">标签,其中引用提要。我通常使用的代码:

from BeautifulSoup import BeautifulSoup as parser

def detect_feeds_in_HTML(input_stream):
    """ examines an open text stream with HTML for referenced feeds.

    This is achieved by detecting all ``link`` tags that reference a feed in HTML.

    :param input_stream: an arbitrary opened input stream that has a :func:`read` method.
    :type input_stream: an input stream (e.g. open file or URL)
    :return: a list of tuples ``(url, feed_type)``
    :rtype: ``list(tuple(str, str))``
    """
    # check if really an input stream
    if not hasattr(input_stream, "read"):
        raise TypeError("An opened input *stream* should be given, was %s instead!" % type(input_stream))
    result = []
    # get the textual data (the HTML) from the input stream
    html = parser(input_stream.read())
    # find all links that have an "alternate" attribute
    feed_urls = html.findAll("link", rel="alternate")
    # extract URL and type
    for feed_link in feed_urls:
        url = feed_link.get("href", None)
        # if a valid URL is there
        if url:
            result.append(url)
    return result

I second waffle paradox in recommending Beautiful Soup for parsing the HTML and then getting the <link rel="alternate"> tags, where the feeds are referenced. The code I usually use:

from BeautifulSoup import BeautifulSoup as parser

def detect_feeds_in_HTML(input_stream):
    """ examines an open text stream with HTML for referenced feeds.

    This is achieved by detecting all ``link`` tags that reference a feed in HTML.

    :param input_stream: an arbitrary opened input stream that has a :func:`read` method.
    :type input_stream: an input stream (e.g. open file or URL)
    :return: a list of tuples ``(url, feed_type)``
    :rtype: ``list(tuple(str, str))``
    """
    # check if really an input stream
    if not hasattr(input_stream, "read"):
        raise TypeError("An opened input *stream* should be given, was %s instead!" % type(input_stream))
    result = []
    # get the textual data (the HTML) from the input stream
    html = parser(input_stream.read())
    # find all links that have an "alternate" attribute
    feed_urls = html.findAll("link", rel="alternate")
    # extract URL and type
    for feed_link in feed_urls:
        url = feed_link.get("href", None)
        # if a valid URL is there
        if url:
            result.append(url)
    return result
半山落雨半山空 2024-12-19 03:50:47

我不知道任何现有的库,但 Atom 或 RSS 提要通常在 部分中用 标记表示,如下所示

<link rel="alternative" type="application/rss+xml" href="http://link.to/feed">
<link rel="alternative" type="application/atom+xml" href="http://link.to/feed">

:将使用 lxml.html 之类的 HTML 解析器下载并解析这些 URL,并获取相关 标记的 href 属性。

I don't know any existing library, but Atom or RSS feeds are usually indicated with a <link> tag in the <head> section as such:

<link rel="alternative" type="application/rss+xml" href="http://link.to/feed">
<link rel="alternative" type="application/atom+xml" href="http://link.to/feed">

Straightforward way would be downloading and parsing these URL's with an HTML parser like lxml.html and getting the href attribute of relevant <link> tags.

安人多梦 2024-12-19 03:50:47

取决于这些提要中信息的格式是否良好(例如,所有链接都采用 http://.../ 的形式吗?您知道它们是否都采用 >hreflink 标签?提要中的所有链接都将指向其他提要吗?等等),我会推荐从简单的正则表达式到直接解析的任何内容。从提要中提取链接的模块。

就解析模块而言,我只能推荐漂亮的汤。尽管即使是最好的解析器也只能做到这一点——尤其是在我上面提到的情况下,如果您不能保证数据中的所有链接都将是指向其他提要的链接;那么你必须自己做一些额外的爬行和探测。

Depending on how well-formed the information in these feeds are (e.g., Are all the links in the form of http://.../? Do you know if they will all be in href or link tags? Are all the links in the feeds going to be to other feeds? etc.), I'd recommend anything from a simple regex to a straight-up parsing module to extract links from the feeds.

As far as parsing modules go, I can only recommend beautiful soup. Though even the best parser will only go so far--esp in the case I mentioned above, if you can't guarantee all links in the data are going to be links to other feeds; then you have to do some additional crawling and probing on your own.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文