如何查找广告的最终目的地 (url)(以编程方式)

发布于 2024-09-30 16:16:23 字数 517 浏览 2 评论 0原文

这可能是微不足道的,也可能不是,但我正在开发一款软件,它将验证通过我的网络应用程序显示的广告的“行尾”域。理想情况下,我有一个不想从中投放广告的域列表(假设 Norton.com 是其中之一),但大多数广告网络通过缩短且神秘的 URL (adsrv.com) 投放广告,最终重定向到诺顿网站。所以问题是:是否有人构建过,或者知道如何构建一个类似抓取工具的工具,该工具将返回广告的最终目标网址。

初步发现:有些广告采用 Flash、JavaScript 或纯 HTML 格式。模拟浏览器是完全可行的,并且可以对抗不同格式的广告。并非所有 Flash 或 JS 广告都有 noflash 或 noscript 替代方案。 (浏览器可能是必要的,但正如所述,这完全没问题......使用诸如 WatiN 或 WatiR 或 WatiJ 或 Selenium 之类的东西......)

更喜欢开源,这样我就可以自己重建一个。真的很感谢帮助!

编辑* 该脚本需要点击广告,因为它可能是 Flash、JS 或纯 HTML。那么 Curl 不太可能是一个选择,除非 Curl 可以点击?

This may be trivial, or not, but I'm working on a piece of software that will verify the "end of the line" domain for ads displayed through my web application. Ideally, I have a list of domains I do not want to serve ads from (let's say Norton.com is one of them) but most ad networks serve ads via shortened, and cryptic, URLs (adsrv.com), that eventually redirect to Norton.com. So the question is: has any one built, or have an idea of how to build, a scraper-like tool that will return the final destination url of an ad.

Initial discovery: Some ads are in Flash, JavaScript, or plain HTML. Emulating a browser is perfectly viable, and would combat different formats of ads. Not all Flash or JS ads have a noflash or noscript alternative. (Browser may be necessary, but as stated this is perfectly fine... Using something like WatiN or WatiR or WatiJ or Selenium, etc...)

Prefer open source so that I could rebuild one myself. Really appreciate help!

EDIT* This script needs to Click on the ad, since it might be Flash, JS, or just HTML plain. So Curl is less likely an option, unless Curl can click?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

许一世地老天荒 2024-10-07 16:16:23

PHP 实现示例:

$k = curl_init('http://goo.gl');
curl_setopt($k, CURLOPT_FOLLOWLOCATION, true); // follow redirects
curl_setopt($k, CURLOPT_USERAGENT, 
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 ' .
'(KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7'); // imitate chrome
curl_setopt($k, CURLOPT_NOBODY, true); // HEAD request only (faster)
curl_setopt($k, CURLOPT_RETURNTRANSFER, true); // don't echo results
curl_exec($k);
$final_url = curl_getinfo($k, CURLINFO_EFFECTIVE_URL); // get last URL followed
curl_close($k);
echo $final_url;

应该返回类似的内容
https://www.google.com/accounts/ServiceLogin?service=urlshortener&continue=http://goo.gl/?authed%3D1&followup=http://goo.gl/?authed%3D1& ;passive=true&go=true

注意:如果出现以下情况,您可能需要使用 curl_setopt() 关闭 CURLOPT_SSL_VERIFYHOSTCURLOPT_SSL_VERIFYPEER您想要通过 HTTPS/SSL 可靠地跟踪

Sample PHP Implementation:

$k = curl_init('http://goo.gl');
curl_setopt($k, CURLOPT_FOLLOWLOCATION, true); // follow redirects
curl_setopt($k, CURLOPT_USERAGENT, 
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 ' .
'(KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7'); // imitate chrome
curl_setopt($k, CURLOPT_NOBODY, true); // HEAD request only (faster)
curl_setopt($k, CURLOPT_RETURNTRANSFER, true); // don't echo results
curl_exec($k);
$final_url = curl_getinfo($k, CURLINFO_EFFECTIVE_URL); // get last URL followed
curl_close($k);
echo $final_url;

Which should return something like
https://www.google.com/accounts/ServiceLogin?service=urlshortener&continue=http://goo.gl/?authed%3D1&followup=http://goo.gl/?authed%3D1&passive=true&go=true

Note: You might need to use curl_setopt() to turn off CURLOPT_SSL_VERIFYHOST and CURLOPT_SSL_VERIFYPEER if you want to reliably follow across HTTPS/SSL

你丑哭了我 2024-10-07 16:16:23
curl --head -L -s -o /dev/null -w %{url_effective} <some-short-url>
  • --head 仅将其限制为 HEAD 请求,因此您不必实际下载页面

  • -L 告诉curl 继续遵循重定向

  • -s 摆脱任何进度表等

  • - o /dev/null 告诉curl 丢弃检索到的标头(我们不关心它们)

  • -w %{url_ effective} 告诉curl 将最后获取的url 作为结果写出到stdout

结果将是有效url 被写入stdout,而不是其他。

curl --head -L -s -o /dev/null -w %{url_effective} <some-short-url>
  • --head restricts it to HEAD requests only, so that you don't have to actually download the pages

  • -L tells curl to keep following redirects

  • -s gets rid of any progress meters, etc

  • -o /dev/null tells curl to throw away the headers retrieved (we don't care about them)

  • -w %{url_effective} tells curl to write out the last fetched url as the result to stdout

The result will be that the effective url is written to stdout, and nothing else.

雪若未夕 2024-10-07 16:16:23

您所说的是跟踪 URL 的重定向,直到超时、进入循环或解析为最终地址。

Net::HTTP 库有一个 以下重定向 示例。

另外,Ruby 的 open-uri 模块会自动重定向,因此我认为您可以在检索页面并找出其登陆位置后向其询问结束 URL。

require 'open-uri'

io = open('http://google.com')
body = io.read
io.base_uri.to_s # => "http://www.google.com/"

请注意,读取正文后,URL 被重定向到 Google 的 / 目录。

这两种情况都只处理服务器重定向。对于元重定向,您必须查看代码,了解它们将您重定向到哪里,然后转到那里。

这将帮助您开始:

require 'nokogiri'

doc = Nokogiri::HTML('<meta http-equiv="REFRESH" content="0;url=http://www.the-domain-you-want-to-redirect-to.com">')

redirect_url = (doc%'meta[@http-equiv="REFRESH"]')['content'].split('=').last rescue nil

You're talking about following the redirection of the URL until it either times out, gets into a loop or resolves to a final address.

The Net::HTTP library has a Following Redirection example.

Also, Ruby's open-uri module will automatically redirect, so I think you can ask it for the ending URL after you retrieve a page and find out where it landed.

require 'open-uri'

io = open('http://google.com')
body = io.read
io.base_uri.to_s # => "http://www.google.com/"

Notice that after reading the body the URL was redirected to Google's / dir.

Both cases will only handle server redirects. For meta-redirects you'll have to look at the code, see where they're redirecting you and go there.

This will get you started:

require 'nokogiri'

doc = Nokogiri::HTML('<meta http-equiv="REFRESH" content="0;url=http://www.the-domain-you-want-to-redirect-to.com">')

redirect_url = (doc%'meta[@http-equiv="REFRESH"]')['content'].split('=').last rescue nil
油焖大侠 2024-10-07 16:16:23

cURL 可以检索 HTTP 标头。继续遍历该链,直到不再获得 Location: 标头,并且您收到的最后一个 Location: 标头就是最终 URL。

cURL can retrieve HTTP headers. Keep stepping through the chain until you're no longer getting Location: headers and the last Location: header you received is the final URL.

爱的那么颓废 2024-10-07 16:16:23

机械化宝石对此很方便:

  agent = Mechanize.new {|a| a.user_agent_alias = 'Windows IE 7'}
  page = agent.get(url)
  final_url = page.uri.to_s

The Mechanize gem is handy for this:

  agent = Mechanize.new {|a| a.user_agent_alias = 'Windows IE 7'}
  page = agent.get(url)
  final_url = page.uri.to_s
乱世争霸 2024-10-07 16:16:23

我最终使用的解决方案是模拟浏览器,加载广告,然后点击。点击是关键因素。其他人提供的解决方案适用于给定的 URL,但无法处理 Flash、JavaScript 等。感谢大家的帮助。

The solution I ended up using was simulating a browser, loading the ad, and clicking. The click was the key ingredient. Solutions offered by others were good for a given URL but would not handle Flash, JavaScript, etc. Appreciate everyones' help.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文