设置一个可以在 Google App 引擎上运行的 Python 屏幕抓取工具

发布于 2024-08-24 13:16:51 字数 156 浏览 11 评论 0原文

我希望设置一个自动屏幕抓取工具，它将使用 python 在 Google 应用程序引擎上运行。我希望它抓取网站并将指定的结果放入应用程序引擎中的实体中。我正在寻找一些关于使用方法的说明。我看过 beautifulsoup，但想知道人们是否可以推荐任何其他可以在 Google 应用引擎上运行的东西。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最后的乘客 2024-08-31 13:16:52

我使用 mechanize 和 BeautifulSoup 取得了良好的（尽管缓慢）结果。事实上，为了节省 Google App Engine 上的代码空间，我使用了 mechanize 中包含的 BeautifulSoup（旧）版本。

我有 mechanize 的 zip 文件，mechanize.zip。该 zip 文件的索引如下所示：

mechanize/
mechanize/__init__.py
mechanize/_auth.py
mechanize/_beautifulsoup.py
mechanize/_clientcookie.py
... etc

然后在我的 Python 代码中，

import sys
sys.path.insert(0, 'mechanize.zip')

import mechanize
from mechanize._beautifulsoup import BeautifulSoup

I have had good (although slow) results using mechanize and BeautifulSoup. In fact, to save code space on Google App Engine, I use the (old) version of BeautifulSoup included in mechanize.

I have mechanize in a zip file, mechanize.zip. The index of this zip file looks like:

mechanize/
mechanize/__init__.py
mechanize/_auth.py
mechanize/_beautifulsoup.py
mechanize/_clientcookie.py
... etc

Then in my Python code,

import sys
sys.path.insert(0, 'mechanize.zip')

import mechanize
from mechanize._beautifulsoup import BeautifulSoup

回复收藏 0 原文

煞人兵器 2024-08-31 13:16:52

另一个选择是 lxml，但它使用 C 代码，因此不适用于 GAE。

回复收藏 0 原文

注定孤独终老 2024-08-31 13:16:52

我使用 BeautifulSoup 解析 HTML 取得了巨大成功。问题是 BeautifulSoup 所做的就是解析 HTML。我最终使用 urlfetch 编写了所有 http 交互。

为了抓取我的目标，我需要一个成熟的代码驱动浏览器，它可以在我的目标网站页面上执行 javascript。我想我必须转储 python 应用程序并转到 java，以便我可以使用 HTMLUnit - 正在进行原型设计。 -mattb

回复收藏 0 原文

挽你眉间 2024-08-31 13:16:51

Beautifulsoup 在 App Engine 上运行良好（只需确保使用 3.0.0 即可） 8，不是 iffy 3.1.0）。我认为主要的替代方案是 html5lib ——我还没有尝试过App Engine，但我相信它确实在那里运行（相当慢 - 如果这是一个问题，我认为你需要坚持使用 BeautifulSoup），例如此服务在 App Engine 上运行并且基于 html5lib。

回复收藏 0 原文

~没有更多了~