python mechanize follow_link 失败
我正在尝试在脚本中访问 NCBI 图像搜索页面 (http://www.ncbi.nlm.nih.gov/images) 上的搜索结果。我想为其提供一个搜索词,报告所有结果,然后继续执行下一个搜索词。为此,我需要在第一页之后访问结果页面,因此我尝试使用 python mechanize 来执行此操作:
import mechanize
browser=mechanize.Browser()
page1=browser.open('http://www.ncbi.nlm.nih.gov/images?term=drug')
a=browser.links(text_regex='Next')
nextlink=a.next()
page2=browser.follow_link(nextlink)
这只会再次返回搜索结果的第一页(在变量 page2 中)。我做错了什么,我怎样才能到达第二页及以后?
I'm trying to access search results on the NCBI Images search page (http://www.ncbi.nlm.nih.gov/images) in a script. I want to feed it a search term, report on all of the results, and then move on to the next search term. To do this I need to get to results pages after the first page, so I'm trying to use python mechanize to do it:
import mechanize
browser=mechanize.Browser()
page1=browser.open('http://www.ncbi.nlm.nih.gov/images?term=drug')
a=browser.links(text_regex='Next')
nextlink=a.next()
page2=browser.follow_link(nextlink)
This just gives me back the first page of search results again (in variable page2). What am I doing wrong, and how can I get to that second page and beyond?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不幸的是,该页面使用 Javascript 将 2459 字节的表单变量 POST 到服务器,只是为了导航到后续页面。以下是一些变量(我总共有 38 个变量):
您需要构造一个到服务器的 POST 请求,其中包含部分或全部这些变量。幸运的是,如果您让它在第 2 页上工作,您只需增加
CurrPage
并发送另一个 POST 即可获取每个后续页面的结果(无需提取链接)。更新 - 该网站完全是一个令人痛苦的地方,但这里是 2-N 个页面的基于 POST 的抓取。将
MAX_PAGE
设置为最高页码 + 1。该脚本将生成类似file_000003.html
的文件。注意:使用前,需要将
POSTDATA
替换为内容此粘贴块(1 个月后过期)。这只是 Firebug 捕获的 POST 请求的主体,我用它来播种正确的参数:Unfortunately that page uses Javascript to POST 2459 bytes of form variables to the server, just to navigate to a subsequent page. Here are a few of the variables (I count 38 vars in total):
You'll need to construct a POST request to the server containing some or all of these variables. Luckily if you get it working for page 2 you can simply increment
CurrPage
and send another POST to get each subsequent page of results (no need to extract links).Update - That site is a total pain-in-the-ass, but here is a POST-based scrape of the 2-N pages. Set
MAX_PAGE
to the highest page number + 1. The script will produce files likefile_000003.html
.Note: Before you use it, you need to replace
POSTDATA
with the contents of this paste blob (it expires in 1 month). It's just the body a POST request as captured by Firebug, which I use to seed the correct params: