检测缩短的或“微小”的目的地网址

发布于 2024-08-25 06:52:18 字数 568 浏览 12 评论 0原文

我刚刚抓取了一堆 Google Buzz 数据，我想知道哪些 Buzz 帖子引用了相同的新闻文章。问题在于，这些帖子中的许多链接已被 URL 缩短程序修改，因此可能会出现许多不同的缩短 URL 实际上都指向同一篇新闻文章的情况。

鉴于我有数百万个帖子，对于我来说，检测

某个网址是否是缩短的网址（来自众多网址缩短服务中的任何一个，或者至少是最大的）的
最有效方法（最好是在 python 中）是什么？缩短的 URL 的“，即缩短的 URL 的长原始版本。

有谁知道 URL 缩短器是否施加严格的请求速率限制？如果我将其降低到 100/秒（全部来自同一 IP 地址），您认为我会遇到麻烦吗？

更新&初步解决方案 这些响应导致了以下简单的解决方案

import urllib2
response = urllib2.urlopen("http://bit.ly/AoifeMcL_ID3") # Some shortened url
url_destination = response.url

就是这样！

原文

I have just scraped a bunch of Google Buzz data, and I want to know which Buzz posts reference the same news articles. The problem is that many of the links in these posts have been modified by URL shorteners, so it could be the case that many distinct shortened URLs actually all point to the same news article.

Given that I have millions of posts, what is the most efficient way (preferably in python) for me to

detect whether a url is a shortened URL (from any of the many URL shortening services, or at least the largest)
Find the "destination" of the shortened url, i.e., the long, original version of the shortened URL.

Does anyone know if the URL shorteners impose strict request rate limits? If I keep this down to 100/second (all coming form the same IP address), do you think I'll run into trouble?

UPDATE & PRELIMINARY SOLUTION
The responses have led to to the following simple solution

import urllib2
response = urllib2.urlopen("http://bit.ly/AoifeMcL_ID3") # Some shortened url
url_destination = response.url

That's it!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

微暖i 2024-09-01 06:52:18

获取缩短 URL 的目标的最简单方法是使用 urllib。如果短 URL 有效（响应代码 200），则该 URL 将返回给您。

>>> import urllib
>>> resp = urllib.urlopen('http://bit.ly/bcFOko')
>>> resp.getcode()
200
>>> resp.url
'http://mrdoob.com/lab/javascript/harmony/'

就是这样！

The easiest way to get the destination of a shortened URL is with urllib. Given that the short URL is valid (response code 200), the URL be returned to you.

>>> import urllib
>>> resp = urllib.urlopen('http://bit.ly/bcFOko')
>>> resp.getcode()
200
>>> resp.url
'http://mrdoob.com/lab/javascript/harmony/'

And that's that!

回复收藏 0 原文

余生再见 2024-09-01 06:52:18

（AFAIK）大多数 url 缩短器都会跟踪已经缩短的 url，因此使用相同 URL 对同一引擎的多个请求将返回相同的短代码。

正如所建议的，提取真实 URL 的最佳方法是从对缩短 URL 的请求的响应中读取标头。不过，某些缩短服务（例如 bit.ly）提供API 方法返回长网址

回复收藏 0 原文

浪推晚风 2024-09-01 06:52:18

列出最常用的 URL 缩短程序，并在发现新的 URL 缩短程序时将其展开，然后检查列表中一项的链接。
除非您跟踪它，否则您不知道该 URL 指向哪里，因此最好的方法应该是跟踪缩短的 url 并提取响应的 http 标头以查看它的去向。

我想每秒 100 个请求你肯定会遇到麻烦（我猜最糟糕的情况就是他们将你的 IP 作为垃圾邮件发送者列入黑名单）。

回复收藏 0 原文

过去的过去 2024-09-01 06:52:18

发布的解决方案仅适用于Python 2.x，对于Python 3.x，你可以这样做

导入 urllib.request 作为 urlreq
链接 = urlreq.urlopen("http://www.google.com")
完整URL = 链接.url

获取完整的 URL。

回复收藏 0 原文

我喜欢麦丽素 2024-09-01 06:52:18

根据我的阅读，这些答案解决了第二个问题。我对第一个问题很感兴趣。查看大约 300 个缩短器的列表后，检测它们的最佳方法似乎是简单地将它们放入列表或正则表达式中，然后查找与其中任何一个的匹配项。

"|".join(z1)
'0rz.tw|1link.in|1url.com|2.gp|2big.at    
r1 = re.compile("|".join(z1),flags=ic)

然后使用 r1 作为正则表达式来匹配您试图在（邮件等）中查找 url 缩短器的任何内容。

这里有一个非常好的列表：longurl.org/services

From what I have read, these answers addressed the second question. I was interested in the first question. After viewing a list of about 300 shorteners it seems the best way to detect them is to simply put them into a list or regex and look for a match with any of them.

"|".join(z1)
'0rz.tw|1link.in|1url.com|2.gp|2big.at    
r1 = re.compile("|".join(z1),flags=ic)

Then using r1 to match as a regex against whatever you are trying to find the url shorteners in (mail, etc...)

A very good list is here: longurl.org/services

回复收藏 0 原文

~没有更多了~

关于作者

烛影斜

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

检测缩短的或“微小”的目的地网址

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

苦中寻乐

lueluelue

嗼ふ静

王权女流氓

与花如笺

残酷

友情链接

检测缩短的或“微小”的目的地网址

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

苦中寻乐

lueluelue

嗼ふ静

王权女流氓

与花如笺

残酷

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。