如何使 LWP::UserAgent 看起来像另一个浏览器?
这是我关于 SO 的第一篇文章,所以要温柔。我什至不确定这是否属于这里,但就在这里。
我想访问我的一个个人帐户上的一些信息。该网站写得不好,需要我手动输入我想要的信息的日期。这确实是一种痛苦。我一直在寻找一个学习更多 Perl 的借口,所以我认为这将是一个很好的机会。我的计划是编写一个 Perl 脚本来登录我的帐户并为我查询信息。然而,我很快就陷入了困境。
my $ua = LWP::UserAgent->new;
my $url = url 'https://account.web.site';
my $res = $ua->request(GET $url);
生成的网页基本上表明我的网络浏览器不受支持。我尝试了许多不同的值,
$ua->agent("");
但似乎没有任何效果。谷歌搜索建议使用这种方法,但它也说 Perl 在网站上被用于恶意目的。网站会阻止此方法吗?我想做的事情可能吗?是否有更合适的不同语言?我想做的事情合法甚至是个好主意吗?也许我应该放弃我的努力。
请注意,为了防止泄露任何私人信息,我在这里编写的代码并不是我正在使用的确切代码。不过,我希望这是非常明显的。
编辑:在 FireFox 中,我禁用了 JavaScript 和 CSS。我登录得很好,没有出现“浏览器不兼容”错误。这似乎不是 JavaScript 的问题。
This is my first post on SO, so be gentle. I'm not even sure if this belongs here, but here goes.
I want to access some information on one of my personal accounts. The website is poorly written and requires me to manually input the date I want the information for. It is truly a pain. I have been looking for an excuse to learn more Perl so I thought this would be a great opportunity. My plan was to write a Perl script that would login to my account and query the information for me. However, I got stuck pretty quickly.
my $ua = LWP::UserAgent->new;
my $url = url 'https://account.web.site';
my $res = $ua->request(GET $url);
The resulting web page basically says that my web browser is not supported. I tried a number of different values for
$ua->agent("");
but nothing nothings seems to work. Google-ing around suggests this method, but it also says that perl is used for malicious reasons on web sites. Do web sites block this method? Is what I am trying to do even possible? Is there a different language that would be more appropriate? Is what I'm trying to do even legal or even a good idea? Maybe I should just abandon my efforts.
Note that to prevent giving away any private information, the code I wrote here is not the exact code I am using. I hope that was pretty obvious, though.
EDIT: In FireFox, I disabled JavaScript and CSS. I logged in just fine without the "Incompatible browser" error. It doesn't seem to be JavaScript issue.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
通过抓取获取不同的网页
我们必须做出一个假设,如果给定相同的输入,网络服务器将返回相同的输出。有了这个假设,我们不可避免地得出结论,我们没有给它相同的输入。在这种情况下,有两个浏览器或 http 客户端:一个为您提供所需的结果(例如 Firefox、IE、Chrome 或 Safari),另一个为您提供所需的结果>没有给您想要的结果(例如,LWP、wget 或 cURL)。
首先消除简单的可能性
在继续之前,首先确保简单的 UserAgents 相同,您可以通过浏览 whatsmyuseragent.com< 来完成此操作/a> 并将另一个浏览器标头中的 UserAgent 字符串设置为该网站返回的任何内容。您还可以使用 Firefox 的 Web 开发人员工具栏 禁用 CSS 和 JavaScript, Java 和元重定向:这将帮助您通过消除真正简单的东西来追踪问题。
现在尝试复制工作浏览器
现在使用 Firefox,您可以使用 FireBug 来分析
REQUEST
已发送。您可以在 FireBug 的 NET 选项卡下执行此操作,不同的浏览器应该具有可以执行 FireBug 与 FireFox 相同功能的工具;但是,如果您不知道相关工具,您仍然可以使用 tshark 或 wireshark,如下所述。值得注意的是,tshark 和 wireshark 总是更准确,因为它们工作在较低的级别,至少根据我的经验,出错的空间较小。例如,您会看到浏览器正在执行的元重定向等操作,有时 FireBug 可能会丢失这些操作。了解第一个有效的 Web 请求后,请尽力将第二个 Web 请求设置为第一个 Web 请求。我的意思是正确设置请求标头和其他请求元素。如果这仍然不起作用,您必须知道第二个浏览器正在做什么才能找出问题所在。
故障排除
为了解决此问题,我们必须全面了解来自两个浏览器的请求。第二个浏览器通常是骗子,它们通常是库和非交互式命令行浏览器,缺乏检查请求的能力。如果他们有能力转储请求,您仍然可以选择简单地检查它们。为此,我建议使用 wireshark 和 tshark 套件。您应该立即收到警告,因为这些操作是在浏览器下运行的。默认情况下,您将看到实际的网络 (IP) 数据包和数据链路帧。您可以使用这样的命令过滤出您专门需要的内容。
这将捕获所有 TCP 数据包,仅显示过滤
http.requests
,然后 Perl 过滤仅第 4 层 HTTP 内容。您可能想要添加到显示过滤器以仅获取单个 Web 服务器-R "http.request and http.host == ''"
您将需要检查所有内容以查看如果两个请求一致,cookie、GET url、用户代理等。确保网站不会做一些愚蠢的事情。
更新于 2010 年 1 月 23 日:根据新信息,我建议设置
Accept
、Accept-Language
、Accept-Charset和<代码>接受编码。您可以通过
$ua->default_headers()
来做到这一点。如果您需要用户代理的更多功能,您始终可以对其进行子类化。我对我的 GData API 采用了这种方法,您可以在我的示例中找到github 上的 UserAgent 子类。Getting a different webpage with scraping
We have to make one assumption, the web-server will return the same output if given the same input. With this assumption we inescapably come to the conclusion we're not giving it the same input. There are two browsers, or http clients in this scenario: the one that is giving you the result you want (ex., Firefox, IE, Chrome, or Safari), and the one that is not giving you the result you want (ex., LWP, wget, or cURL).
Kill off the easy possibilities first
Before, continuing firstly make sure the simple UserAgents are the same, you can do this by browsing to whatsmyuseragent.com and setting the UserAgent string in the header of the other browser to whatever that website returns. You can also use Firefox's Web Developer's Toolbar to disable CSS, and JavaScript, Java, and meta-redirects: this will help you track down the problem by killing off the really simple stuff.
Now attempt to duplicate the working browser
Now with Firefox you can use FireBug to analyze the
REQUEST
that is sent. You can do this under theNET
tab in FireBug, different browsers should have tools that can do what FireBug does with FireFox; however, if you don't know the tool in question you can still use tshark or wireshark as described below. It is important to note that tshark and wireshark will always be more accurate because they work at a lower level which at least in my experience leaves less room for error. For example, you'll see things like meta-redirects the browser is doing which sometimes FireBug can lose track of.After you understand the first web-request that works, do your best to set the second web-request to that of the first. By this I mean setting the request-headers properly and other request elements. If this still doesn't work you have to know what the second browser is doing to see what is wrong.
Troubleshooting
In order to troubleshoot this, we must have a total understanding of the requests from both browsers. The second browser is usually tricker, these are often libraries and non-interactive command line browsers that lack the ability to check the request. If they have the ability to dump the request you might still opt to simply check them anyway. To do this I suggest the wireshark and tshark suite. Immediately, you should be warned that because these operate below the browser. By default, you'll see the actual network (IP) packets, and data-link frames. You can filter out what you need specifically with a command like this.
This will capture all of the TCP packets, display-filter only the
http.requests
, then perl filter for only layer 4 HTTP stuff. You might want to add to the display filter to only grab a single web server too-R "http.request and http.host == ''"
You're going to want to check everything to see if the two requests are in line, cookies, GET url, user-agent, etc. Make sure the site doesn't do something goofy.
Updated Jan 23 2010: Based on the new information I would suggest setting
Accept
, andAccept-Language
,Accept-Charset
andAccept-Encoding
. You can do that with through$ua->default_headers()
. If what you demand is a lot more functionality out of your useragent, you can always subclass it. I took this aproach for my GData API, you can find my example on of a UserAgent subclass on github.您可能应该查看 WWW::Mechanize,它是 LWP::UserAgent 面向此类网站自动化。特别是,请参阅 agent_alias 方法。
有些网站确实会根据用户代理阻止连接,但您可以使用 Perl 将其设置为您想要的任何内容。网站可能还会查找通常由特定浏览器生成的其他请求标头(例如 Accept 标头)并拒绝不包含它们的连接,但如果您弄清楚它正在寻找什么,您也可以添加这些标头。
一般来说,网站不可能阻止不同的客户端冒充受支持的浏览器。无论它在寻找什么,您最终都可以复制它。
它也有可能正在寻求 JavaScript 支持。在这种情况下,您可以查看 WWW::Scripter,它是一个子类WWW::Mechanize 添加了 JavaScript 支持。这是相当新的,我还没有尝试过。
You should probably look at WWW::Mechanize, which is a subclass of LWP::UserAgent that is oriented towards that sort of website automation. In particular, see the agent_alias method.
Some websites do block connections based on the User-Agent, but you can set that to whatever you want using Perl. It's possible that a website might also look for other request headers normally generated by a particular browser (like the Accept header) and refuse connections that don't include them, but you can add those headers too, if you figure out what it's looking for.
In general, it's impossible for a website to prevent a different client from impersonating a supported browser. No matter what it's looking for, you can eventually duplicate it.
It's also possible that it's looking for JavaScript support. In that case, you might look at WWW::Scripter, which is a subclass of WWW::Mechanize that adds JavaScript support. It's fairly new and I haven't tried it yet.
该线程几乎肯定不仅仅是更改用户代理。
我看到两条路。我们可以尝试在浏览器中关闭 javascript 和 css,并了解有关在依赖 LWP::UserAgent 时进入 HTTP::Request 和 HTTP::Response 对象的更多信息,或者转到 WWW::Scripter 并使用 javascript。
就在粗制滥造的 Craigslist 文本广告中,有三页密集的、几乎没有空间的 javascript 和 css,然后它们加载更多专门的代码,这样如果我通过 comcast 进来,我就会找到特殊的 javascript,只针对 comcast 用户,已加载到最终页面。他们这样做的方式是试图通过在 HEAD 中放入代码来破坏机器人,该代码区分了 HTML 1.0 和 1.1 之间的差异,说,哦,有一点问题,你需要一个 http 刷新,然后对你进行攻击用额外的代码来窥探ISP,谁知道什么,当然,cookie信息(当你学习如何减慢LWP速度并插入回调代码来窥探像*shark但在perl内部时,你可以每次都打印出cookie,还可以看看服务器如何不断尝试改变“你的”标题和“你的”请求 - 重新协商“你的”请求 - 哦,你不想买一辆便宜的汽车,你想买一辆玛莎拉蒂并抵押你的房子来做到这一点,即窥探你的 ISP,为什么不是你的联系人和你所有的谷歌历史!!!谁知道?!)。
CL 将一个随机 ID 号码放入 Alice 的 HEAD 中,然后低声说你需要一个 http 请求才能吞下红色药丸,别再把它藏在舌下了。这样,大多数机器人就会窒息并接受虚假的净化页面,即截断的“主页”。另外,如果我从页面上抓取 url,我无法使用 LWP “单击”它们,因为我从未学习过我的 ID,也没有学习过 javascript 以在 $ua->get( $url&ID=9dd887f8f89d9" ); 或者也许简单的 get 可以与 &ID 一起使用。它比用户代理要多得多,但你可以做到这一点,并且你将从中获得所需的所有帮助
正如你所看到的,第一个路径是关闭所有这些,看看您是否可以了解重新协商的请求的 URI,不是原始 URL,而是 URI,然后获取它,没有 javascript,没有 WWW::Scripter 听起来 LWP 会为您工作。喜欢听到更多关于最初更改 default_header 中的 ACCEPT 的信息,以及服务器是否说,哦,你的意思是接受这个和这个和这个,在重新协商请求对象中吞下红色药丸您可以通过在请求和响应对话中插入回调来窥探。
第二条路,WWW::Scripter,只有当我们决定吞下红色药丸,并进入爱丽丝的兔子洞(又名矩阵)时,perl 哲学规定在更加努力地工作之前耗尽其他可能性。否则我们就不会学到 101 个 http 先决条件,因此升级到更大的锤子就是这样,或者为阿司匹林滴酸,或者不是?
This thread is almost certainly not about merely changing User Agent.
I see two paths. Either we can experiment with turning off javascript and css in browser, and learn more about getting into HTTP::Request and HTTP::Response objects while relying on LWP::UserAgent, or, go to WWW::Scripter and use javascript.
Just in crude Craigslist text ads, there are three pages of densely packed, almost space-free javascript and css, and then they load more and specialized code so that if I come in by comcast I then find special javascript, just targeting comcast users, has been loaded into the final page. The way they do that is in their attempt to break robots by putting code in the HEAD which lawyers the diff between HTML 1.0 and 1.1 to say, oh, there is something a little bit wrong, you need an http refresh, and then porking you with extra code to snoop out isp and who knows what, cookie info for sure(you can print out cookies at every turn when you learn how to slow LWP down and insert callback code to snoop like *shark but inside perl, also see how server keeps trying to change "your" headers and "your" request--re-negotiate "your" request--oh you don't want to buy a cheap car you want to buy a Maserati and mortgage your house to do it i.e. snoop your ISP and why not your Contacts and all your google history!!! Who knows?!).
CL puts a random ID number into Alice's HEAD, then whispers that you need an http request to swallow the red pill, stop hiding it under your tongue. That way most robots choke and accept a fake sanitized page i.e. truncated "home page". Also, if I scrape url's from the page, I can't "click" on them using LWP because I never learned my ID, nor did I learn the javascript to parrot the ID back in javascript before a $ua->get( $url&ID=9dd887f8f89d9" ); or maybe the simple get would work with &ID. It's way more than User Agent but you can do it and you're getting all the help you need from
As you can see, the first path is to turn all that off and see if you can learn your re-negotiated request's URI, not original URL but URI. Then get it, no javascript, no WWW::Scripter. It sounds like LWP will work for you. I would like to hear more about changing ACCEPT's in default_header initially, and whether server says, oh, you mean ACCEPT this and this and this, swallow red pill in re-negotiate Request object. You can snoop that by inserting callbacks in request and response conversation.
Second path, WWW::Scripter, is only if we decided to swallow the Red Pill, and go down Alice's Rabbit Hole aka Matrix. perl philosophy dictates exhausting other possibilities before working harder. Otherwise we wouldn't have learned our 101 http prereqs, so escalating to bigger hammer would be just that, or dropping acid for aspirin, or not?
那么,您愿意告诉我们您尝试过的那些东西是什么吗?
我通常做的就是在我的常规浏览器的 URL 栏中输入
javascript:prompt('your agent string is',navigator.userAgent)
,按 Enter 键,然后剪切并粘贴它告诉我的内容。使用wireshark并监控实际数据包肯定是多余的吗?您尝试访问的网站无法知道您正在使用 Perl。只要告诉它它希望听到的任何内容即可。
Well, would you like to tell us what those things you tried were?
What I normally do is type
javascript:prompt('your agent string is',navigator.userAgent)
into my regular browser's URL bar, hit enter, and cut and paste what it tells me. Surely using wireshark and monitoring actual packets is overkill? The website you're trying to get to has no way of knowing you're using Perl. Just tell it whatever it expects to hear.
工具:带有 TamperData 和 LiveHTTPHeaders 的 Firefox、Devel::REPL、LWP。
分析:在浏览器中,关闭Javascript和Java,从目标网站删除所有cookie,启动TamperData日志记录,登录网站。停止 TamperData 日志记录并回顾您在登录过程中可能发出的许多请求。找到第一个请求(您故意提出的请求)并查看其详细信息。
实验:启动
re.pl
,并开始重新创建浏览器的交互。这是第一步。如果您在任何时候得到不匹配的响应,那么您就做错了。您通常可以[1]通过查看
$r->request
并与 Firefox 发送的请求进行比较来了解内容。重要的是要记住,没有什么魔法,并且您知道服务器知道的一切。如果您对看似相同的请求无法得到相同的响应,那么您就错过了一些东西。仅仅到达第一页通常是不够的。您可能需要解析表单(使用
HTML::Form
),遵循重定向(如上面配置的,UA 自动执行此操作,但有时将其关闭并手动执行是值得的),并尝试根据最简单的提示对一个奇怪的黑客攻击在一起的登录序列进行逆向工程。祝你好运。[1]:除了 LWP 的 cookie 实现中存在某些错误的情况外,我不会在这里详细说明。即使这样,如果您知道自己在寻找什么,您也可以找到它。
Tools: Firefox with TamperData and LiveHTTPHeaders, Devel::REPL, LWP.
Analysis: In the browser, turn off Javascript and Java, delete any cookies from the target web site, start TamperData logging, log in to web site. Stop TamperData logging and look back through the many requests you likely placed during the login process. Find the first request (the one you made on purpose) and look at its details.
Experimentation: Start
re.pl
, and start recreating the browser's interaction.So that's step one. If you get mismatched responses at any point, you did something wrong. You can usually[1] find out what by looking at
$r->request
and comparing with the request Firefox sent. The important thing is to remember that there is no magic and that you know everything the server knows. If you can't get the same response to what appears to be the same request, you missed something.Getting to the first page is usually not enough. You'll likely need to parse forms (with
HTML::Form
), follow redirects (as configured above, UA does that automatically, but sometimes it pays to turn that off and do it by hand), and try to reverse engineer a weirdly-hacked-together login sequence from the barest of hints. Good luck.[1]: Except in the case of certain bugs in LWP's cookies implementation that I won't detail here. And even then you can spot it if you know what you're looking for.
您的 Perl 脚本是否与您引用的 Firefox 浏览器在同一台计算机上运行?它可以基于子网或传入 IP 地址进行过滤。您的网址是 https,因此您的浏览器上可能还加载了服务器所期望的一些 PSK(预共享密钥)或证书。在公司内部网站之外极不可能出现这种情况。
Is your perl script running on the same machine as the firefox browser you reference? It could be filtering based on subnet or incoming IP address. Your url is https, so there could be also be some PSK (pre shared key) or certificate loaded on you browser taht the server is expecting. Extremely unlikely outside of an internal companies intranet site.
我刚刚注意到一些事情。
这行:
它在我的机器上根本不起作用。但我通过将其更改为:
I just noticed something.
This line:
It doesn't work on my machine at all. But I got it to work by changing it to:
添加引荐来源部分使其对我有用:
adding the referrer portion made it work for me: