如何查找广告的最终目的地 (url)（以编程方式）

发布于 2024-09-30 16:16:23 字数 517 浏览 2 评论 0原文

这可能是微不足道的，也可能不是，但我正在开发一款软件，它将验证通过我的网络应用程序显示的广告的“行尾”域。理想情况下，我有一个不想从中投放广告的域列表（假设 Norton.com 是其中之一），但大多数广告网络通过缩短且神秘的 URL (adsrv.com) 投放广告，最终重定向到诺顿网站。所以问题是：是否有人构建过，或者知道如何构建一个类似抓取工具的工具，该工具将返回广告的最终目标网址。

初步发现：有些广告采用 Flash、JavaScript 或纯 HTML 格式。模拟浏览器是完全可行的，并且可以对抗不同格式的广告。并非所有 Flash 或 JS 广告都有 noflash 或 noscript 替代方案。（浏览器可能是必要的，但正如所述，这完全没问题......使用诸如 WatiN 或 WatiR 或 WatiJ 或 Selenium 之类的东西......）

更喜欢开源，这样我就可以自己重建一个。真的很感谢帮助！

编辑* 该脚本需要点击广告，因为它可能是 Flash、JS 或纯 HTML。那么 Curl 不太可能是一个选择，除非 Curl 可以点击？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

许一世地老天荒 2024-10-07 16:16:23

PHP 实现示例：

$k = curl_init('http://goo.gl');
curl_setopt($k, CURLOPT_FOLLOWLOCATION, true); // follow redirects
curl_setopt($k, CURLOPT_USERAGENT, 
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 ' .
'(KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7'); // imitate chrome
curl_setopt($k, CURLOPT_NOBODY, true); // HEAD request only (faster)
curl_setopt($k, CURLOPT_RETURNTRANSFER, true); // don't echo results
curl_exec($k);
$final_url = curl_getinfo($k, CURLINFO_EFFECTIVE_URL); // get last URL followed
curl_close($k);
echo $final_url;

应该返回类似的内容
https://www.google.com/accounts/ServiceLogin?service=urlshortener&continue=http://goo.gl/?authed%3D1&followup=http://goo.gl/?authed%3D1& ;passive=true&go=true

注意：如果出现以下情况，您可能需要使用 curl_setopt() 关闭 CURLOPT_SSL_VERIFYHOST 和 CURLOPT_SSL_VERIFYPEER您想要通过 HTTPS/SSL 可靠地跟踪

Sample PHP Implementation:

$k = curl_init('http://goo.gl');
curl_setopt($k, CURLOPT_FOLLOWLOCATION, true); // follow redirects
curl_setopt($k, CURLOPT_USERAGENT, 
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 ' .
'(KHTML, like Gecko) Chrome/7.0.517.41 Safari/534.7'); // imitate chrome
curl_setopt($k, CURLOPT_NOBODY, true); // HEAD request only (faster)
curl_setopt($k, CURLOPT_RETURNTRANSFER, true); // don't echo results
curl_exec($k);
$final_url = curl_getinfo($k, CURLINFO_EFFECTIVE_URL); // get last URL followed
curl_close($k);
echo $final_url;

Which should return something like
https://www.google.com/accounts/ServiceLogin?service=urlshortener&continue=http://goo.gl/?authed%3D1&followup=http://goo.gl/?authed%3D1&passive=true&go=true

Note: You might need to use curl_setopt() to turn off CURLOPT_SSL_VERIFYHOST and CURLOPT_SSL_VERIFYPEER if you want to reliably follow across HTTPS/SSL

回复收藏 0 原文

你丑哭了我 2024-10-07 16:16:23

curl --head -L -s -o /dev/null -w %{url_effective} <some-short-url>

--head 仅将其限制为 HEAD 请求，因此您不必实际下载页面
-L 告诉curl 继续遵循重定向
-s 摆脱任何进度表等
- o /dev/null 告诉curl 丢弃检索到的标头（我们不关心它们）
-w %{url_ effective} 告诉curl 将最后获取的url 作为结果写出到stdout

结果将是有效url 被写入stdout，而不是其他。

curl --head -L -s -o /dev/null -w %{url_effective} <some-short-url>

--head restricts it to HEAD requests only, so that you don't have to actually download the pages
-L tells curl to keep following redirects
-s gets rid of any progress meters, etc
-o /dev/null tells curl to throw away the headers retrieved (we don't care about them)
-w %{url_effective} tells curl to write out the last fetched url as the result to stdout

The result will be that the effective url is written to stdout, and nothing else.

回复收藏 0 原文

雪若未夕 2024-10-07 16:16:23

您所说的是跟踪 URL 的重定向，直到超时、进入循环或解析为最终地址。

Net::HTTP 库有一个以下重定向示例。

另外，Ruby 的 open-uri 模块会自动重定向，因此我认为您可以在检索页面并找出其登陆位置后向其询问结束 URL。

require 'open-uri'

io = open('http://google.com')
body = io.read
io.base_uri.to_s # => "http://www.google.com/"

请注意，读取正文后，URL 被重定向到 Google 的 / 目录。

这两种情况都只处理服务器重定向。对于元重定向，您必须查看代码，了解它们将您重定向到哪里，然后转到那里。

这将帮助您开始：

require 'nokogiri'

doc = Nokogiri::HTML('<meta http-equiv="REFRESH" content="0;url=http://www.the-domain-you-want-to-redirect-to.com">')

redirect_url = (doc%'meta[@http-equiv="REFRESH"]')['content'].split('=').last rescue nil

You're talking about following the redirection of the URL until it either times out, gets into a loop or resolves to a final address.

The Net::HTTP library has a Following Redirection example.

Also, Ruby's open-uri module will automatically redirect, so I think you can ask it for the ending URL after you retrieve a page and find out where it landed.

require 'open-uri'

io = open('http://google.com')
body = io.read
io.base_uri.to_s # => "http://www.google.com/"

Notice that after reading the body the URL was redirected to Google's / dir.

Both cases will only handle server redirects. For meta-redirects you'll have to look at the code, see where they're redirecting you and go there.

This will get you started:

require 'nokogiri'

doc = Nokogiri::HTML('<meta http-equiv="REFRESH" content="0;url=http://www.the-domain-you-want-to-redirect-to.com">')

redirect_url = (doc%'meta[@http-equiv="REFRESH"]')['content'].split('=').last rescue nil

回复收藏 0 原文

油焖大侠 2024-10-07 16:16:23

cURL 可以检索 HTTP 标头。继续遍历该链，直到不再获得 Location: 标头，并且您收到的最后一个 Location: 标头就是最终 URL。

回复收藏 0 原文

爱的那么颓废 2024-10-07 16:16:23

机械化宝石对此很方便：

  agent = Mechanize.new {|a| a.user_agent_alias = 'Windows IE 7'}
  page = agent.get(url)
  final_url = page.uri.to_s

The Mechanize gem is handy for this:

  agent = Mechanize.new {|a| a.user_agent_alias = 'Windows IE 7'}
  page = agent.get(url)
  final_url = page.uri.to_s

回复收藏 0 原文

乱世争霸 2024-10-07 16:16:23

我最终使用的解决方案是模拟浏览器，加载广告，然后点击。点击是关键因素。其他人提供的解决方案适用于给定的 URL，但无法处理 Flash、JavaScript 等。感谢大家的帮助。

回复收藏 0 原文

~没有更多了~

关于作者

丑疤怪

暂无简介

0 文章

0 评论

21 人气

关注发私信

友情链接

文江博客

如何查找广告的最终目的地 (url)（以编程方式）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

如何查找广告的最终目的地 (url)（以编程方式）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。