使用 PHP (libcurl)、Python (liburl) 或 AJAX 读取 JSP servlet 页面

发布于 2024-10-22 00:35:35 字数 3662 浏览 0 评论 0原文

因此，我尝试卷曲页面：http://rutgers.bncollege.com/webapp/wcs/stores/servlet/TextBookProcessDropdownsCmd?campusId=35577418以提取一些数据。问题是我不断在标头上收到 404 错误或 302 状态。我怀疑这与 Barnes and Noble 的 Tomcat 在远程请求时未正确重定向到 servlet 有关。但这只是猜测。我已经尝试使用 PHP5 中的 libcurl、Python 中的 liburl、AJAX（框架和非框架）以及使用终端中的 curl 二进制文件进行多种实现。

以下是我回显响应文本时收到的输出示例：

发生错误：
错误代码：404
消息目标：/BNCB_GenericError.jsp
Servlet 名称：JSP 1.2 处理器
堆栈跟踪：[Ljava.lang.StackTraceElement;@14b6c4d
根本原因：不适用

以下是我发送和接收的标头：

响应标头

过期时间：1994 年 12 月 1 日星期四 16:00:00 GMT
缓存控制 no-cache="set-cookie,set-cookie2"
位置 http://uncc.bncollege.com/webapp/wcs/stores/servlet/TBDropDownView?campusId=1748054&dojo.transport=xmlhttp&dojo.preventCache=1300287790307&ddkey=TextBookProcessDropdownsCmd
内容长度0
PerfHeader 持续时间=D=56606，
时间=t=1300287776952692
内容类型text/html;
字符集=ISO-8859-1
内容语言 en-US 日期 16 日星期三
2011 年 3 月 15:02:57 GMT
连接保持活动
改变接受编码
设置 Cookie WC_SESSION_ESTABLISHED=true;Domain=.bncollege.com;Path=/
WC_ACTIVESTOREDATA=%2d1%2c0;域=.bncollege.com;路径=/WC_USERSESSION_46349649=46349649%2cnull%2cnull%2c%2d2000%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2c% 5b0%7cnull%7cnull%7cnull%7c%2d2000%5d%2c8XwO3l7WhszbuSO41vmZUDtbpoQ%3d;域名=.bncollege.com;路径=/
JSESSIONID=0000AuZi2Uo6F6Ft5xihFdUsBQn:app06z02;域名=.bncollege.com;路径=/
TS884e96=b7fb55c6fcd8aff3987bcdb831a8255a16b4cbcb208252614d80d120;

请求标头

托管uncc.bncollege.com
用户代理 Mozilla/5.0（Macintosh；U；
英特尔 Mac OS X 10.6； en-US；
rv:1.9.2.15pre) 壁虎/20110227
Firefox/3.6.15pre（Mac 社区
构建，ElFurbe）
接受text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
接受语言 en-us,en;q=0.5
接受编码 gzip,deflate
接受字符集 ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive 115 连接保持活动
引用者http://localhost/bn.php
来源http://localhost

这是代码：

function bufferURL($url,$bindArgs) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, 'http://rutgers.bncollege.com/webapp/wcs/stores/servlet/TBWizardView?catalogId=10001&storeId=58552&langId=-1');
    curl_setopt($ch, CURLOPT_POST, 1);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_COOKIEJAR, "my_cookies.txt");
    curl_setopt($ch, CURLOPT_COOKIEFILE, "my_cookies.txt");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3");
    curl_exec($ch);

    $url .= '?';
    foreach ($bindArgs as $a => $b) $url .= $a . '=' . $b . '&';
    $url = substr($url,0,strlen($url)-1);

    curl_setopt($ch, CURLOPT_HTTPGET, true);
    curl_setopt($ch, CURLOPT_URL, $url);
    echo curl_exec($ch);
}

BN 似乎使用 Dojo 对 servlet 执行 AJAX 查询；但是，即使使用相同的请求格式，我也无法复制。

原文

So, I'm attempting to curl a page: http://rutgers.bncollege.com/webapp/wcs/stores/servlet/TextBookProcessDropdownsCmd?campusId=35577418 in order to extract some data. The issue is that I keep getting either a 404 error or 302 status on the header. I suspect it has something to do with Barnes and Noble's Tomcat not properly redirecting to the servlet when requested remotely. That's just speculation though. I have tried multiple implementations using both libcurl in PHP5, liburl in Python, AJAX (framework and non-framework), and using the curl binary from my terminal.

Here's an example of the output I receive when I echo the response text:

An error has occurred:
Error Code: 404
Message Target: /BNCB_GenericError.jsp
Servlet Name: JSP 1.2 Processor
Stack Trace: [Ljava.lang.StackTraceElement;@14b6c4d
Root Cause: N/A

Here's are the headers I'm sending and receiving:

Response Headers

Expires Thu, 01 Dec 1994 16:00:00 GMT
Cache-Control no-cache="set-cookie,set-cookie2"
Location http://uncc.bncollege.com/webapp/wcs/stores/servlet/TBDropDownView?campusId=1748054&dojo.transport=xmlhttp&dojo.preventCache=1300287790307&ddkey=TextBookProcessDropdownsCmd
Content-Length 0
PerfHeader duration=D=56606,
time=t=1300287776952692
Content-Type text/html;
charset=ISO-8859-1
Content-Language en-US Date Wed, 16
Mar 2011 15:02:57 GMT
Connection keep-alive
Vary Accept-Encoding
Set-Cookie WC_SESSION_ESTABLISHED=true;Domain=.bncollege.com;Path=/
WC_ACTIVESTOREDATA=%2d1%2c0;Domain=.bncollege.com;Path=/WC_USERSESSION_46349649=46349649%2cnull%2cnull%2c%2d2000%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2c%5b0%7cnull%7cnull%7cnull%7c%2d2000%5d%2c8XwO3l7WhszbuSO41vmZUDtbpoQ%3d;Domain=.bncollege.com;Path=/
JSESSIONID=0000AuZi2Uo6F6Ft5xihFdUsBQn:app06z02;Domain=.bncollege.com;Path=/
TS884e96=b7fb55c6fcd8aff3987bcdb831a8255a16b4cbcb208252614d80d120;

Request Headers

Host uncc.bncollege.com
User-Agent Mozilla/5.0 (Macintosh; U;
Intel Mac OS X 10.6; en-US;
rv:1.9.2.15pre) Gecko/20110227
Firefox/3.6.15pre (Mac Community
Build, ElFurbe)
Accept text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
Accept-Language en-us,en;q=0.5
Accept-Encoding gzip,deflate
Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive 115 Connection keep-alive
Referer http://localhost/bn.php
Origin http://localhost

And here's the code for that:

function bufferURL($url,$bindArgs) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, 'http://rutgers.bncollege.com/webapp/wcs/stores/servlet/TBWizardView?catalogId=10001&storeId=58552&langId=-1');
    curl_setopt($ch, CURLOPT_POST, 1);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_COOKIEJAR, "my_cookies.txt");
    curl_setopt($ch, CURLOPT_COOKIEFILE, "my_cookies.txt");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3");
    curl_exec($ch);

    $url .= '?';
    foreach ($bindArgs as $a => $b) $url .= $a . '=' . $b . '&';
    $url = substr($url,0,strlen($url)-1);

    curl_setopt($ch, CURLOPT_HTTPGET, true);
    curl_setopt($ch, CURLOPT_URL, $url);
    echo curl_exec($ch);
}

BN appears to be using Dojo to perform their AJAX queries to the servlet; however, even when using the same request format, I am unable to replicate.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

再可℃爱ぅ一点好了 2024-10-29 00:35:35

这看起来有点可疑。您设置的第一个选项是 URL。然后将其设置为 POST 数据。然后在选项块的末尾，添加一个问号，手动添加更多查询字符串，然后去掉多余的“&”符号...然后在更改后重新重置 URL。到 GET 请求！这毫无意义。

添加问号肯定至少是破坏它的因素之一——URL 中已经有一个问号，表明它也已经有一个查询字符串。此外，您似乎没有对任何附加参数进行正确的 URL 编码，因此这些参数也可能会造成问题。

看看 parse_url和 parse_str，两个工具您可以使用反汇编原始 URL 并检索原始查询字符串。将其分解为组件并将原始查询字符串提取到数组中后，您可以将新的查询选项添加到该数组/替换现有条目/删除不需要的条目。然后，您可以使用 http_build_query 重建查询字符串，然后重新组装正确的 URL。

然而，只有当目标接受 GET 数据而不是 POST 数据时，这种方法才能真正发挥作用。

如果目标需要 POST，您可以单独保留查询字符串，并使用 CURLOPT_POSTFIELDS 选项将查询字符串作为数组提交。它将为您完成所有艰苦的工作。

请注意，您似乎正在尝试将某人的 ajax 端点用于意外目的。他们可能采取了额外的保护措施来阻止自动请求。