Python-是否有一个模块可以自动从网页上抓取文章内容?
我知道有 lxml 和 BeautifulSoup,但这对我的项目不起作用,因为我事先不知道我试图从中抓取文章的网站的 HTML 格式是什么。是否有一个类似于 Readability 的 python 类型模块可以很好地查找文章内容并返回它?
I know there is lxml and BeautifulSoup, but that won't work for my project, because I don't know in advance what the HTML format of the site I am trying to scrape an article off of will be. Is there a python-type module similar to Readability that does a pretty good job at finding the content of an article and returning it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
可以使用 PhantomJS (C++) 或 PyPhantomJS (Python)。
它们都是基于 WebKit 的无头浏览器,您可以通过 JavaScript 完全控制它们。因为你可以通过 JavaScript 来控制它,所以我发现抓取文章内容之类的事情真的很容易。
PyPhantomJS 还有一个插件系统,所以这绝对是一个优点。 :)
It's possible to do using PhantomJS (C++) or PyPhantomJS (Python).
They're both headless WebKit based browsers, which you can fully control from JavaScript. Because you can control it from JavaScript, I find it is really easy to do stuff such as scrape the content of an article.
PyPhantomJS also has a plugin system, so that's definitely a plus. :)
从内容页面提取真实内容不能自动完成 - 至少不能使用标准工具。您必须定义/识别实际内容的存储位置(通过在您自己的 HTML 提取代码中指定相关的 CSS ID 或类)。
Extracting the real content from a content-page can not be done automatically - at least not with the standard tools. You have to define/identify where the real content is stored (by specifying the related CSS ID or class in your own HTML extraction code).
使用 HTQL,查询为:
&html_main_text
Using HTQL, the query is:
&html_main_text