使用 Google App Engine 进行网页/屏幕抓取 - 代码可以在 python 解释器中运行,但不能在 GAE 中运行

发布于 2024-08-04 00:24:38 字数 444 浏览 4 评论 0原文

我想用 GAE 进行一些网页抓取。 (无限校园学生信息门户,仅供参考)。此服务需要您登录才能进入网站。 我有一些代码可以在普通 python 中使用 mechanize 工作。当我得知我无法在 Google App Engine 中使用 mechanize 时,我最终使用了 urllib2 + ClientForm。我无法让它登录到服务器,所以在摆弄 cookie 处理几个小时后,我在普通的 python 解释器中运行了完全相同的代码,并且它起作用了。我找到了日志文件,并看到了大量有关在请求中删除“主机”标头的消息...我在 Google 代码上找到了源文件,主机标头位于“不受信任”列表中,并从所有请求中删除了用户代码。

显然,GAE 删除了主机头,IC 需要主机头来确定要登录哪个学校系统,这就是为什么我看起来无法登录。

我该如何解决这个问题?我无法在向目标网站提交的虚假表单中指定任何其他内容。为什么这首先会成为一个“安全漏洞”?

I want to do some web scraping with GAE. (Infinite Campus Student Information Portal, fyi). This service requires you to login to get in the website.
I had some code that worked using mechanize in normal python. When I learned that I couldn't use mechanize in Google App Engine I ended up using urllib2 + ClientForm. I couldn't get it to login to the server, so after a few hours of fiddling with cookie handling I ran the exact same code in a normal python interpreter, and it worked. I found the log file and saw a ton of messages about stripping out the 'host' header in my request... I found the source file on Google Code and the host header was in an 'untrusted' list and removed from all requests by user code.

Apparently GAE strips out the host header, which is required by I.C. to determine which school system to log you in, which is why it appeared like I couldn't login.

How would I get around this problem? I can't specify anything else in my fake form submission to the target site. Why would this be a "security hole" in the first place?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

却一份温柔 2024-08-11 00:24:38

App Engine 不会删除 Host 标头:它会根据您请求的 URI 强制它成为准确的值。假设 URI 是绝对的,服务器甚至不允许考虑主机标头,根据 RFC2616

  1. 如果 Request-URI 是绝对 URI,则主机是 Request-URI 的一部分。
    中的任何主机头字段值
    请求必须被忽略。

...所以我怀疑您误诊了问题的原因。尝试将请求定向到您控制的“虚拟”服务器(例如您的另一个非常简单的应用程序引擎应用程序),以便您可以查看请求的所有标头和正文,因为它来自您的 GAE 应用程序,而不是它的来源来自你的“普通Python解释器”。通过这种方式你观察到了什么?

App Engine does not strip out the Host header: it forces it to be an accurate value based on the URI you are requesting. Assuming that URI's absolute, the server isn't even allowed to consider the Host header anyway, per RFC2616:

  1. If Request-URI is an absoluteURI, the host is part of the Request-URI.
    Any Host header field value in the
    request MUST be ignored.

...so I suspect you're misdiagnosing the cause of your problem. Try directing the request to a "dummy" server that you control (e.g. another very simple app engine app of yours) so you can look at all the headers and body of the request as it comes from your GAE app, vs, how it comes from your "normal python interpreter". What do you observe this way?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文