使用 Wget 从需要设置 cookie 的站点下载 PDF 文件
我想访问报纸网站,然后下载他们的电子纸副本(PDF 格式)。该网站要求我使用我的电子邮件地址和密码登录,然后它允许我访问这些 PDF URL。
我在 wget 中“设置会话”时遇到问题。当我从浏览器登录该网站时,它设置了两个 cookie 值:
[email protected]
Password=12345
我尝试过:
wget --post-data "[email protected]&Password=12345" http://epaper.abc.com/login.aspx
但是,刚刚下载了登录页面并将其保存在本地
登录页面上的表单有两个字段:
txtUserID
txtPassword
和单选按钮如下:
<input id="rbtnManchester" type="radio" checked="checked" name="txtpub" value="44">
另一个按钮:
<input id="rbtnLondon" type="radio" name="txtpub" value="64">
如果我将其发布到 login.aspx 页面,我得到相同的输出
wget --post-data "[email protected]&txtPassword=12345&txtpub=44" http://epaper.abc.com/login.aspx
如果我这样做:
--save-cookies abc_cookies.txt
它似乎除了默认内容之外没有任何内容。
最后,如果我执行 --debug ,它也会说:
...
Set-Cookie: ASP.NET_SessionId=05kphcn4hjmblq45qgnjoe41; path=/; HttpOnly
...
Stored cookie epaper.abc.com -1 (ANY) / <session> <insecure> [expiry none] ASP.NET_SessionId 05kphcn4hjmblq45qgnjoe41
Length: 107253 (105K) [text/html]
Saving to: `login.aspx'
...
Saving cookies to abc_cookies.txt.
但是,abc_cookies.txt 仅显示以下内容:
# HTTP cookie file.
# Generated by Wget on 2011-08-16 08:03:05.
# Edit at your own risk.
I want to access a newspaper site and then download their epaper copies (in PDF). The site requires me to login using my email address and password and then it permits me to access those PDF URLs.
I'm having trouble 'setting my session' in wget. When I login into the site from my browser, it sets two cookie values:
[email protected]
Password=12345
I tried:
wget --post-data "[email protected]&Password=12345" http://epaper.abc.com/login.aspx
However, that just downloaded the login page and saved it locally
The FORM on the login page has two fields:
txtUserID
txtPassword
and radiobuttons like this:
<input id="rbtnManchester" type="radio" checked="checked" name="txtpub" value="44">
Another button:
<input id="rbtnLondon" type="radio" name="txtpub" value="64">
If I post this to the login.aspx page, I get the same output
wget --post-data "[email protected]&txtPassword=12345&txtpub=44" http://epaper.abc.com/login.aspx
If I do:
--save-cookies abc_cookies.txt
it doesnt seem to have anything other than the default content.
For the last if I do --debug as well it says:
...
Set-Cookie: ASP.NET_SessionId=05kphcn4hjmblq45qgnjoe41; path=/; HttpOnly
...
Stored cookie epaper.abc.com -1 (ANY) / <session> <insecure> [expiry none] ASP.NET_SessionId 05kphcn4hjmblq45qgnjoe41
Length: 107253 (105K) [text/html]
Saving to: `login.aspx'
...
Saving cookies to abc_cookies.txt.
However, abc_cookies.txt shows ONLY the following:
# HTTP cookie file.
# Generated by Wget on 2011-08-16 08:03:05.
# Edit at your own risk.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
只是一个建议,您是否尝试使用查询字符串变量(显然不太安全)?
您可能必须转义特殊字符,具体取决于您的 shell/操作系统。
Just a suggestion, did you try using querystring variables (not too secure, obviously)?
You might have to escape the special characters depending on your shell / OS.