We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(4)
有
feedfinder
:There's
feedfinder
:我第二个华夫饼悖论推荐 Beautiful Soup 来解析 HTML,然后获取 Beautiful Soup链接 rel="备用">标签,其中引用提要。我通常使用的代码:
I second waffle paradox in recommending Beautiful Soup for parsing the HTML and then getting the <link rel="alternate"> tags, where the feeds are referenced. The code I usually use:
我不知道任何现有的库,但 Atom 或 RSS 提要通常在
部分中用
标记表示,如下所示
:将使用 lxml.html 之类的 HTML 解析器下载并解析这些 URL,并获取相关
标记的
href
属性。I don't know any existing library, but Atom or RSS feeds are usually indicated with a
<link>
tag in the<head>
section as such:Straightforward way would be downloading and parsing these URL's with an HTML parser like lxml.html and getting the
href
attribute of relevant<link>
tags.取决于这些提要中信息的格式是否良好(例如,所有链接都采用
http://.../
的形式吗?您知道它们是否都采用>href
或link
标签?提要中的所有链接都将指向其他提要吗?等等),我会推荐从简单的正则表达式到直接解析的任何内容。从提要中提取链接的模块。就解析模块而言,我只能推荐漂亮的汤。尽管即使是最好的解析器也只能做到这一点——尤其是在我上面提到的情况下,如果您不能保证数据中的所有链接都将是指向其他提要的链接;那么你必须自己做一些额外的爬行和探测。
Depending on how well-formed the information in these feeds are (e.g., Are all the links in the form of
http://.../
? Do you know if they will all be inhref
orlink
tags? Are all the links in the feeds going to be to other feeds? etc.), I'd recommend anything from a simple regex to a straight-up parsing module to extract links from the feeds.As far as parsing modules go, I can only recommend beautiful soup. Though even the best parser will only go so far--esp in the case I mentioned above, if you can't guarantee all links in the data are going to be links to other feeds; then you have to do some additional crawling and probing on your own.