我怎样才能刮掉这个框架?
如果您访问 此链接 现在,您可能会收到 VBScript 错误。
另一方面,如果您先访问此链接,然后然后上面的链接(在同一会话中),页面就会出现。
该应用程序的设置方式是,第一页旨在充当第二(主)页面中的框架。如果您点击一下,您就会看到它是如何工作的。
我的问题:如何使用 Python 抓取第一页?我已经尝试了我能想到的所有方法——urllib、urllib2、mechanize——但我得到的只是 500 个错误或超时。
我怀疑答案在于机械化,但我的机械化还不足以破解这个问题。有人可以帮忙吗?
If you visit this link right now, you will probably get a VBScript error.
On the other hand, if you visit this link first and then the above link (in the same session), the page comes through.
The way this application is set up, the first page is meant to serve as a frame in the second (main) page. If you click around a bit, you'll see how it works.
My question: How do I scrape the first page with Python? I've tried everything I can think of -- urllib, urllib2, mechanize -- and all I get is 500 errors or timeouts.
I suspect the answers lies with mechanize, but my mechanize-fu isn't good enough to crack this. Can anyone help?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
它总是归结为请求/响应模型。您只需制作一系列 http 请求即可获得所需的响应。在这种情况下,您还需要服务器将每个请求视为同一会话的一部分。为此,您需要弄清楚服务器如何跟踪会话。它可以是很多东西,从 cookie 到隐藏输入以形成操作、发布数据或查询字符串。如果我不得不猜测的话,在这种情况下我会把钱花在 cookie 上(我还没有检查链接)。如果这是正确的,您需要发送第一个请求,保存您返回的 cookie,然后将该 cookie 与第二个请求一起发送。
初始页面也可能包含可让您进入第二页的按钮和链接。这些链接将具有类似
其中很多内容都是由第一页生成的。
"Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b="
部分对您必须从第一页获取的一些会话信息进行编码。当然,您甚至可能需要两者都做。
It always comes down to the request/response model. You just have to craft a series of http requests such that you get the desired responses. In this case, you also need the server to treat each request as part of the same session. To do that, you need to figure out how the server is tracking sessions. It could be a number of things, from cookies to hidden inputs to form actions, post data, or query strings. If I had to guess I'd put my money on a cookie in this case (I haven't checked the links). If this holds true, you need to send the first request, save the cookie you get back, and then send that cookie along with the 2nd request.
It could also be that the initial page will have buttons and links that get you to the second page. Those links will have something like
<A href="http://cad.chp.ca.gov/iiqr.asp?Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b=">
where a lot of the gobbedlygook is generated by the first page.The
"Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b="
part encodes some session information that you must get from the first page.And, of course, you might even need to do both.
除了 Mechanize 之外,您还可以尝试 BeautifulSoup。我不太肯定,但你应该能够将 DOM 解析到框架页面中。
我还发现 Tamper Data 是一个相当有用的插件,当我'我正在写爬虫。
You might also try BeautifulSoup in addition to Mechanize. I'm not positive, but you should be able to parse the DOM down into the framed page.
I also find Tamper Data to be a rather useful plugin when I'm writing scrapers.