如何查找广告的最终目的地 (url)(以编程方式)
这可能是微不足道的,也可能不是,但我正在开发一款软件,它将验证通过我的网络应用程序显示的广告的“行尾”域。理想情况下,我有一个不想从中投放广告的域列表(假设 Norton.com 是其中之一),但大多数广告网络通过缩短且神秘的 URL (adsrv.com) 投放广告,最终重定向到诺顿网站。所以问题是:是否有人构建过,或者知道如何构建一个类似抓取工具的工具,该工具将返回广告的最终目标网址。
初步发现:有些广告采用 Flash、JavaScript 或纯 HTML 格式。模拟浏览器是完全可行的,并且可以对抗不同格式的广告。并非所有 Flash 或 JS 广告都有 noflash 或 noscript 替代方案。 (浏览器可能是必要的,但正如所述,这完全没问题......使用诸如 WatiN 或 WatiR 或 WatiJ 或 Selenium 之类的东西......)
更喜欢开源,这样我就可以自己重建一个。真的很感谢帮助!
编辑* 该脚本需要点击广告,因为它可能是 Flash、JS 或纯 HTML。那么 Curl 不太可能是一个选择,除非 Curl 可以点击?
This may be trivial, or not, but I'm working on a piece of software that will verify the "end of the line" domain for ads displayed through my web application. Ideally, I have a list of domains I do not want to serve ads from (let's say Norton.com is one of them) but most ad networks serve ads via shortened, and cryptic, URLs (adsrv.com), that eventually redirect to Norton.com. So the question is: has any one built, or have an idea of how to build, a scraper-like tool that will return the final destination url of an ad.
Initial discovery: Some ads are in Flash, JavaScript, or plain HTML. Emulating a browser is perfectly viable, and would combat different formats of ads. Not all Flash or JS ads have a noflash or noscript alternative. (Browser may be necessary, but as stated this is perfectly fine... Using something like WatiN or WatiR or WatiJ or Selenium, etc...)
Prefer open source so that I could rebuild one myself. Really appreciate help!
EDIT* This script needs to Click on the ad, since it might be Flash, JS, or just HTML plain. So Curl is less likely an option, unless Curl can click?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
PHP 实现示例:
应该返回类似的内容
https://www.google.com/accounts/ServiceLogin?service=urlshortener&continue=http://goo.gl/?authed%3D1&followup=http://goo.gl/?authed%3D1& ;passive=true&go=true
注意:如果出现以下情况,您可能需要使用
curl_setopt()
关闭CURLOPT_SSL_VERIFYHOST
和CURLOPT_SSL_VERIFYPEER
您想要通过 HTTPS/SSL 可靠地跟踪Sample PHP Implementation:
Which should return something like
https://www.google.com/accounts/ServiceLogin?service=urlshortener&continue=http://goo.gl/?authed%3D1&followup=http://goo.gl/?authed%3D1&passive=true&go=true
Note: You might need to use
curl_setopt()
to turn offCURLOPT_SSL_VERIFYHOST
andCURLOPT_SSL_VERIFYPEER
if you want to reliably follow across HTTPS/SSL--head
仅将其限制为 HEAD 请求,因此您不必实际下载页面-L
告诉curl 继续遵循重定向-s
摆脱任何进度表等- o /dev/null
告诉curl 丢弃检索到的标头(我们不关心它们)-w %{url_ effective}
告诉curl 将最后获取的url 作为结果写出到stdout结果将是有效url 被写入stdout,而不是其他。
--head
restricts it to HEAD requests only, so that you don't have to actually download the pages-L
tells curl to keep following redirects-s
gets rid of any progress meters, etc-o /dev/null
tells curl to throw away the headers retrieved (we don't care about them)-w %{url_effective}
tells curl to write out the last fetched url as the result to stdoutThe result will be that the effective url is written to stdout, and nothing else.
您所说的是跟踪 URL 的重定向,直到超时、进入循环或解析为最终地址。
Net::HTTP 库有一个 以下重定向 示例。
另外,Ruby 的 open-uri 模块会自动重定向,因此我认为您可以在检索页面并找出其登陆位置后向其询问结束 URL。
请注意,读取正文后,URL 被重定向到 Google 的
/
目录。这两种情况都只处理服务器重定向。对于元重定向,您必须查看代码,了解它们将您重定向到哪里,然后转到那里。
这将帮助您开始:
You're talking about following the redirection of the URL until it either times out, gets into a loop or resolves to a final address.
The Net::HTTP library has a Following Redirection example.
Also, Ruby's open-uri module will automatically redirect, so I think you can ask it for the ending URL after you retrieve a page and find out where it landed.
Notice that after reading the body the URL was redirected to Google's
/
dir.Both cases will only handle server redirects. For meta-redirects you'll have to look at the code, see where they're redirecting you and go there.
This will get you started:
cURL 可以检索 HTTP 标头。继续遍历该链,直到不再获得
Location:
标头,并且您收到的最后一个Location:
标头就是最终 URL。cURL can retrieve HTTP headers. Keep stepping through the chain until you're no longer getting
Location:
headers and the lastLocation:
header you received is the final URL.机械化宝石对此很方便:
The Mechanize gem is handy for this:
我最终使用的解决方案是模拟浏览器,加载广告,然后点击。点击是关键因素。其他人提供的解决方案适用于给定的 URL,但无法处理 Flash、JavaScript 等。感谢大家的帮助。
The solution I ended up using was simulating a browser, loading the ad, and clicking. The click was the key ingredient. Solutions offered by others were good for a given URL but would not handle Flash, JavaScript, etc. Appreciate everyones' help.