使用 PHP (libcurl)、Python (liburl) 或 AJAX 读取 JSP servlet 页面

发布于 2024-10-22 00:35:35 字数 3662 浏览 0 评论 0原文

因此,我尝试卷曲页面:http://rutgers.bncollege.com/webapp/wcs/stores/servlet/TextBookProcessDropdownsCmd?campusId=35577418以提取一些数据。问题是我不断在标头上收到 404 错误或 302 状态。我怀疑这与 Barnes and Noble 的 Tomcat 在远程请求时未正确重定向到 servlet 有关。但这只是猜测。我已经尝试使用 PHP5 中的 libcurl、Python 中的 liburl、AJAX(框架和非框架)以及使用终端中的 curl 二进制文件进行多种实现。

以下是我回显响应文本时收到的输出示例:

发生错误:

错误代码:404

消息目标:/BNCB_GenericError.jsp

Servlet 名称:JSP 1.2 处理器

堆栈跟踪:[Ljava.lang.StackTraceElement;@14b6c4d

根本原因:不适用

以下是我发送和接收的标头:

响应标头

过期时间:1994 年 12 月 1 日星期四 16:00:00 GMT

缓存控制 no-cache="set-cookie,set-cookie2"

位置 http://uncc.bncollege.com/webapp/wcs/stores/servlet/TBDropDownView?campusId=1748054&dojo.transport=xmlhttp&dojo.preventCache=1300287790307&ddkey=TextBookProcessDropdownsCmd

内容长度0

PerfHeader 持续时间=D=56606,

时间=t=1300287776952692

内容类型text/html;

字符集=ISO-8859-1

内容语言 en-US 日期 16 日星期三

2011 年 3 月 15:02:57 GMT

连接保持活动

改变接受编码

设置 Cookie WC_SESSION_ESTABLISHED=true;Domain=.bncollege.com;Path=/

WC_ACTIVESTOREDATA=%2d1%2c0;域=.bncollege.com;路径=/WC_USERSESSION_46349649=46349649%2cnull%2cnull%2c%2d2000%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2c% 5b0%7cnull%7cnull%7cnull%7c%2d2000%5d%2c8XwO3l7WhszbuSO41vmZUDtbpoQ%3d;域名=.bncollege.com;路径=/

JSESSIONID=0000AuZi2Uo6F6Ft5xihFdUsBQn:app06z02;域名=.bncollege.com;路径=/

TS884e96=b7fb55c6fcd8aff3987bcdb831a8255a16b4cbcb208252614d80d120;

请求标头

托管uncc.bncollege.com

用户代理 Mozilla/5.0(Macintosh;U;

英特尔 Mac OS X 10.6; en-US;

rv:1.9.2.15pre) 壁虎/20110227

Firefox/3.6.15pre(Mac 社区

构建,ElFurbe)

接受text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8

接受语言 en-us,en;q=0.5

接受编码 gzip,deflate

接受字符集 ISO-8859-1,utf-8;q=0.7,*;q=0.7

Keep-Alive 115 连接保持活动

引用者http://localhost/bn.php

来源http://localhost

这是代码:

function bufferURL($url,$bindArgs) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, 'http://rutgers.bncollege.com/webapp/wcs/stores/servlet/TBWizardView?catalogId=10001&storeId=58552&langId=-1');
    curl_setopt($ch, CURLOPT_POST, 1);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_COOKIEJAR, "my_cookies.txt");
    curl_setopt($ch, CURLOPT_COOKIEFILE, "my_cookies.txt");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3");
    curl_exec($ch);

    $url .= '?';
    foreach ($bindArgs as $a => $b) $url .= $a . '=' . $b . '&';
    $url = substr($url,0,strlen($url)-1);

    curl_setopt($ch, CURLOPT_HTTPGET, true);
    curl_setopt($ch, CURLOPT_URL, $url);
    echo curl_exec($ch);
}

BN 似乎使用 Dojo 对 servlet 执行 AJAX 查询;但是,即使使用相同的请求格式,我也无法复制。

So, I'm attempting to curl a page: http://rutgers.bncollege.com/webapp/wcs/stores/servlet/TextBookProcessDropdownsCmd?campusId=35577418 in order to extract some data. The issue is that I keep getting either a 404 error or 302 status on the header. I suspect it has something to do with Barnes and Noble's Tomcat not properly redirecting to the servlet when requested remotely. That's just speculation though. I have tried multiple implementations using both libcurl in PHP5, liburl in Python, AJAX (framework and non-framework), and using the curl binary from my terminal.

Here's an example of the output I receive when I echo the response text:

An error has occurred:

Error Code: 404

Message Target: /BNCB_GenericError.jsp

Servlet Name: JSP 1.2 Processor

Stack Trace: [Ljava.lang.StackTraceElement;@14b6c4d

Root Cause: N/A

Here's are the headers I'm sending and receiving:

Response Headers

Expires Thu, 01 Dec 1994 16:00:00 GMT

Cache-Control no-cache="set-cookie,set-cookie2"

Location http://uncc.bncollege.com/webapp/wcs/stores/servlet/TBDropDownView?campusId=1748054&dojo.transport=xmlhttp&dojo.preventCache=1300287790307&ddkey=TextBookProcessDropdownsCmd

Content-Length 0

PerfHeader duration=D=56606,

time=t=1300287776952692

Content-Type text/html;

charset=ISO-8859-1

Content-Language en-US Date Wed, 16

Mar 2011 15:02:57 GMT

Connection keep-alive

Vary Accept-Encoding

Set-Cookie WC_SESSION_ESTABLISHED=true;Domain=.bncollege.com;Path=/

WC_ACTIVESTOREDATA=%2d1%2c0;Domain=.bncollege.com;Path=/WC_USERSESSION_46349649=46349649%2cnull%2cnull%2c%2d2000%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2c%5b0%7cnull%7cnull%7cnull%7c%2d2000%5d%2c8XwO3l7WhszbuSO41vmZUDtbpoQ%3d;Domain=.bncollege.com;Path=/

JSESSIONID=0000AuZi2Uo6F6Ft5xihFdUsBQn:app06z02;Domain=.bncollege.com;Path=/

TS884e96=b7fb55c6fcd8aff3987bcdb831a8255a16b4cbcb208252614d80d120;

Request Headers

Host uncc.bncollege.com

User-Agent Mozilla/5.0 (Macintosh; U;

Intel Mac OS X 10.6; en-US;

rv:1.9.2.15pre) Gecko/20110227

Firefox/3.6.15pre (Mac Community

Build, ElFurbe)

Accept text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8

Accept-Language en-us,en;q=0.5

Accept-Encoding gzip,deflate

Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7

Keep-Alive 115 Connection keep-alive

Referer http://localhost/bn.php

Origin http://localhost

And here's the code for that:

function bufferURL($url,$bindArgs) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, 'http://rutgers.bncollege.com/webapp/wcs/stores/servlet/TBWizardView?catalogId=10001&storeId=58552&langId=-1');
    curl_setopt($ch, CURLOPT_POST, 1);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_COOKIEJAR, "my_cookies.txt");
    curl_setopt($ch, CURLOPT_COOKIEFILE, "my_cookies.txt");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3");
    curl_exec($ch);

    $url .= '?';
    foreach ($bindArgs as $a => $b) $url .= $a . '=' . $b . '&';
    $url = substr($url,0,strlen($url)-1);

    curl_setopt($ch, CURLOPT_HTTPGET, true);
    curl_setopt($ch, CURLOPT_URL, $url);
    echo curl_exec($ch);
}

BN appears to be using Dojo to perform their AJAX queries to the servlet; however, even when using the same request format, I am unable to replicate.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

再可℃爱ぅ一点好了 2024-10-29 00:35:35

这看起来有点可疑。您设置的第一个选项是 URL。然后将其设置为 POST 数据。然后在选项块的末尾,添加一个问号,手动添加更多查询字符串,然后去掉多余的“&”符号...然后在更改后重新重置 URL。到 GET 请求!这毫无意义。

添加问号肯定至少是破坏它的因素之一——URL 中已经有一个问号,表明它也已经有一个查询字符串。此外,您似乎没有对任何附加参数进行正确的 URL 编码,因此这些参数也可能会造成问题。

看看 parse_urlparse_str,两个工具您可以使用反汇编原始 URL 并检索原始查询字符串。将其分解为组件并将原始查询字符串提取到数组中后,您可以将新的查询选项添加到该数组/替换现有条目/删除不需要的条目。然后,您可以使用 http_build_query 重建查询字符串,然后重新组装正确的 URL。

然而,只有当目标接受 GET 数据而不是 POST 数据时,这种方法才能真正发挥作用。

如果目标需要 POST,您可以单独保留查询字符串,并使用 CURLOPT_POSTFIELDS 选项将查询字符串作为数组提交。它将为您完成所有艰苦的工作。

请注意,您似乎正在尝试将某人的 ajax 端点用于意外目的。他们可能采取了额外的保护措施来阻止自动请求。

This looks kind of fishy. The very first option you set is a URL. Then you set it to POST data. Then at the end of the option block, you append a question mark to it, manually append even more query strings, then yank off the extra ampersand... and then re-reset the URL after changing it to a GET request! This makes no sense.

The addition of the question mark is surely at least one of the things that are breaking it -- there's already one in the URL, signifying that it also already has a query string. Further, you don't seem to be properly URL-encoding any of your additional parameters, so those might also be causing trouble.

Take a look at parse_url and parse_str, two tools you can use to disassemble the original URL and retrieve the original query string. Once you've broken it into components and have extracted the original query string into an array, you can add your new query options to that array / replace existing entries / remove ones you don't want. You can then use http_build_query to rebuild the query string, and then reassemble the proper URL.

However, that will only really work well if the target accepts this data as a GET instead of as a POST.

If the target requires a POST, you can probably leave the query string alone, and simply submit your query strings as an array, using the CURLOPT_POSTFIELDS option. It will do all the hard work for you.

Be advised, it kind of looks like you're trying to use someone's ajax endpoint for unexpected purposes. It is possible that there are additional protections that they have put in place to deter automated requests.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文