使用 PHP (libcurl)、Python (liburl) 或 AJAX 读取 JSP servlet 页面
因此,我尝试卷曲页面:http://rutgers.bncollege.com/webapp/wcs/stores/servlet/TextBookProcessDropdownsCmd?campusId=35577418
以提取一些数据。问题是我不断在标头上收到 404 错误或 302 状态。我怀疑这与 Barnes and Noble 的 Tomcat 在远程请求时未正确重定向到 servlet 有关。但这只是猜测。我已经尝试使用 PHP5 中的 libcurl、Python 中的 liburl、AJAX(框架和非框架)以及使用终端中的 curl 二进制文件进行多种实现。
以下是我回显响应文本时收到的输出示例:
发生错误:
错误代码:404
消息目标:/BNCB_GenericError.jsp
Servlet 名称:JSP 1.2 处理器
堆栈跟踪:[Ljava.lang.StackTraceElement;@14b6c4d
根本原因:不适用
以下是我发送和接收的标头:
响应标头
过期时间:1994 年 12 月 1 日星期四 16:00:00 GMT
缓存控制 no-cache="set-cookie,set-cookie2"
内容长度0
PerfHeader 持续时间=D=56606,
时间=t=1300287776952692
内容类型text/html;
字符集=ISO-8859-1
内容语言 en-US 日期 16 日星期三
2011 年 3 月 15:02:57 GMT
连接保持活动
改变接受编码
设置 Cookie WC_SESSION_ESTABLISHED=true;Domain=.bncollege.com;Path=/
WC_ACTIVESTOREDATA=%2d1%2c0;域=.bncollege.com;路径=/WC_USERSESSION_46349649=46349649%2cnull%2cnull%2c%2d2000%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2c% 5b0%7cnull%7cnull%7cnull%7c%2d2000%5d%2c8XwO3l7WhszbuSO41vmZUDtbpoQ%3d;域名=.bncollege.com;路径=/
JSESSIONID=0000AuZi2Uo6F6Ft5xihFdUsBQn:app06z02;域名=.bncollege.com;路径=/
TS884e96=b7fb55c6fcd8aff3987bcdb831a8255a16b4cbcb208252614d80d120;
请求标头
托管uncc.bncollege.com
用户代理 Mozilla/5.0(Macintosh;U;
英特尔 Mac OS X 10.6; en-US;
rv:1.9.2.15pre) 壁虎/20110227
Firefox/3.6.15pre(Mac 社区
构建,ElFurbe)
接受text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
接受语言 en-us,en;q=0.5
接受编码 gzip,deflate
接受字符集 ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive 115 连接保持活动
这是代码:
function bufferURL($url,$bindArgs) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://rutgers.bncollege.com/webapp/wcs/stores/servlet/TBWizardView?catalogId=10001&storeId=58552&langId=-1');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_COOKIEJAR, "my_cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "my_cookies.txt");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3");
curl_exec($ch);
$url .= '?';
foreach ($bindArgs as $a => $b) $url .= $a . '=' . $b . '&';
$url = substr($url,0,strlen($url)-1);
curl_setopt($ch, CURLOPT_HTTPGET, true);
curl_setopt($ch, CURLOPT_URL, $url);
echo curl_exec($ch);
}
BN 似乎使用 Dojo 对 servlet 执行 AJAX 查询;但是,即使使用相同的请求格式,我也无法复制。
So, I'm attempting to curl a page: http://rutgers.bncollege.com/webapp/wcs/stores/servlet/TextBookProcessDropdownsCmd?campusId=35577418
in order to extract some data. The issue is that I keep getting either a 404 error or 302 status on the header. I suspect it has something to do with Barnes and Noble's Tomcat not properly redirecting to the servlet when requested remotely. That's just speculation though. I have tried multiple implementations using both libcurl in PHP5, liburl in Python, AJAX (framework and non-framework), and using the curl binary from my terminal.
Here's an example of the output I receive when I echo the response text:
An error has occurred:
Error Code: 404
Message Target: /BNCB_GenericError.jsp
Servlet Name: JSP 1.2 Processor
Stack Trace: [Ljava.lang.StackTraceElement;@14b6c4d
Root Cause: N/A
Here's are the headers I'm sending and receiving:
Response Headers
Expires Thu, 01 Dec 1994 16:00:00 GMT
Cache-Control no-cache="set-cookie,set-cookie2"
Content-Length 0
PerfHeader duration=D=56606,
time=t=1300287776952692
Content-Type text/html;
charset=ISO-8859-1
Content-Language en-US Date Wed, 16
Mar 2011 15:02:57 GMT
Connection keep-alive
Vary Accept-Encoding
Set-Cookie WC_SESSION_ESTABLISHED=true;Domain=.bncollege.com;Path=/
WC_ACTIVESTOREDATA=%2d1%2c0;Domain=.bncollege.com;Path=/WC_USERSESSION_46349649=46349649%2cnull%2cnull%2c%2d2000%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2c%5b0%7cnull%7cnull%7cnull%7c%2d2000%5d%2c8XwO3l7WhszbuSO41vmZUDtbpoQ%3d;Domain=.bncollege.com;Path=/
JSESSIONID=0000AuZi2Uo6F6Ft5xihFdUsBQn:app06z02;Domain=.bncollege.com;Path=/
TS884e96=b7fb55c6fcd8aff3987bcdb831a8255a16b4cbcb208252614d80d120;
Request Headers
Host uncc.bncollege.com
User-Agent Mozilla/5.0 (Macintosh; U;
Intel Mac OS X 10.6; en-US;
rv:1.9.2.15pre) Gecko/20110227
Firefox/3.6.15pre (Mac Community
Build, ElFurbe)
Accept text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
Accept-Language en-us,en;q=0.5
Accept-Encoding gzip,deflate
Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive 115 Connection keep-alive
Referer http://localhost/bn.php
Origin http://localhost
And here's the code for that:
function bufferURL($url,$bindArgs) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://rutgers.bncollege.com/webapp/wcs/stores/servlet/TBWizardView?catalogId=10001&storeId=58552&langId=-1');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_COOKIEJAR, "my_cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "my_cookies.txt");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3");
curl_exec($ch);
$url .= '?';
foreach ($bindArgs as $a => $b) $url .= $a . '=' . $b . '&';
$url = substr($url,0,strlen($url)-1);
curl_setopt($ch, CURLOPT_HTTPGET, true);
curl_setopt($ch, CURLOPT_URL, $url);
echo curl_exec($ch);
}
BN appears to be using Dojo to perform their AJAX queries to the servlet; however, even when using the same request format, I am unable to replicate.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这看起来有点可疑。您设置的第一个选项是 URL。然后将其设置为 POST 数据。然后在选项块的末尾,添加一个问号,手动添加更多查询字符串,然后去掉多余的“&”符号...然后在更改后重新重置 URL。到 GET 请求!这毫无意义。
添加问号肯定至少是破坏它的因素之一——URL 中已经有一个问号,表明它也已经有一个查询字符串。此外,您似乎没有对任何附加参数进行正确的 URL 编码,因此这些参数也可能会造成问题。
看看
parse_url
和parse_str
,两个工具您可以使用反汇编原始 URL 并检索原始查询字符串。将其分解为组件并将原始查询字符串提取到数组中后,您可以将新的查询选项添加到该数组/替换现有条目/删除不需要的条目。然后,您可以使用http_build_query
重建查询字符串,然后重新组装正确的 URL。然而,只有当目标接受 GET 数据而不是 POST 数据时,这种方法才能真正发挥作用。
如果目标需要 POST,您可以单独保留查询字符串,并使用
CURLOPT_POSTFIELDS
选项将查询字符串作为数组提交。它将为您完成所有艰苦的工作。请注意,您似乎正在尝试将某人的 ajax 端点用于意外目的。他们可能采取了额外的保护措施来阻止自动请求。
This looks kind of fishy. The very first option you set is a URL. Then you set it to POST data. Then at the end of the option block, you append a question mark to it, manually append even more query strings, then yank off the extra ampersand... and then re-reset the URL after changing it to a GET request! This makes no sense.
The addition of the question mark is surely at least one of the things that are breaking it -- there's already one in the URL, signifying that it also already has a query string. Further, you don't seem to be properly URL-encoding any of your additional parameters, so those might also be causing trouble.
Take a look at
parse_url
andparse_str
, two tools you can use to disassemble the original URL and retrieve the original query string. Once you've broken it into components and have extracted the original query string into an array, you can add your new query options to that array / replace existing entries / remove ones you don't want. You can then usehttp_build_query
to rebuild the query string, and then reassemble the proper URL.However, that will only really work well if the target accepts this data as a GET instead of as a POST.
If the target requires a POST, you can probably leave the query string alone, and simply submit your query strings as an array, using the
CURLOPT_POSTFIELDS
option. It will do all the hard work for you.Be advised, it kind of looks like you're trying to use someone's ajax endpoint for unexpected purposes. It is possible that there are additional protections that they have put in place to deter automated requests.