检测缩短的或“微小”的目的地网址
我刚刚抓取了一堆 Google Buzz 数据,我想知道哪些 Buzz 帖子引用了相同的新闻文章。问题在于,这些帖子中的许多链接已被 URL 缩短程序修改,因此可能会出现许多不同的缩短 URL 实际上都指向同一篇新闻文章的情况。
鉴于我有数百万个帖子,对于我来说,检测
- 某个网址是否是缩短的网址(来自众多网址缩短服务中的任何一个,或者至少是最大的)的
- 最有效方法(最好是在 python 中)是什么?缩短的 URL 的“,即缩短的 URL 的长原始版本。
有谁知道 URL 缩短器是否施加严格的请求速率限制?如果我将其降低到 100/秒(全部来自同一 IP 地址),您认为我会遇到麻烦吗?
更新&初步解决方案 这些响应导致了以下简单的解决方案
import urllib2
response = urllib2.urlopen("http://bit.ly/AoifeMcL_ID3") # Some shortened url
url_destination = response.url
就是这样!
I have just scraped a bunch of Google Buzz data, and I want to know which Buzz posts reference the same news articles. The problem is that many of the links in these posts have been modified by URL shorteners, so it could be the case that many distinct shortened URLs actually all point to the same news article.
Given that I have millions of posts, what is the most efficient way (preferably in python) for me to
- detect whether a url is a shortened URL (from any of the many URL shortening services, or at least the largest)
- Find the "destination" of the shortened url, i.e., the long, original version of the shortened URL.
Does anyone know if the URL shorteners impose strict request rate limits? If I keep this down to 100/second (all coming form the same IP address), do you think I'll run into trouble?
UPDATE & PRELIMINARY SOLUTION
The responses have led to to the following simple solution
import urllib2
response = urllib2.urlopen("http://bit.ly/AoifeMcL_ID3") # Some shortened url
url_destination = response.url
That's it!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
获取缩短 URL 的目标的最简单方法是使用 urllib。如果短 URL 有效(响应代码 200),则该 URL 将返回给您。
就是这样!
The easiest way to get the destination of a shortened URL is with
urllib
. Given that the short URL is valid (response code 200), the URL be returned to you.And that's that!
(AFAIK)大多数 url 缩短器都会跟踪已经缩短的 url,因此使用相同 URL 对同一引擎的多个请求将返回相同的短代码。
正如所建议的,提取真实 URL 的最佳方法是从对缩短 URL 的请求的响应中读取标头。不过,某些缩短服务(例如 bit.ly)提供API 方法 返回长网址
(AFAIK) Most url shorteners keep track of urls already shortened, so several requests to the same engine with the same URL will return the same short code.
As has been suggested, the best way to extract the real url is to read the headers from a response to a request for the shortened URL. However, some shortening services (eg bit.ly) provide an API method to return the long url
列出最常用的 URL 缩短程序,并在发现新的 URL 缩短程序时将其展开,然后检查列表中一项的链接。
除非您跟踪它,否则您不知道该 URL 指向哪里,因此最好的方法应该是跟踪缩短的 url 并提取响应的 http 标头以查看它的去向。
我想每秒 100 个请求你肯定会遇到麻烦(我猜最糟糕的情况就是他们将你的 IP 作为垃圾邮件发送者列入黑名单)。
Do a list of the most used URL-shorteners and expand it while you discover new ones, then check a link for one item of the list.
You do not know where the URL points to unless you follow it, so best way to do this should be to follow the shortened url and extract the http header of the response to see where it heads to.
I guess with 100 requests per second you could surely go into trouble (I guestt the worst that can happen is they blacklist your IP as a spammer).
发布的解决方案仅适用于Python 2.x,对于Python 3.x,你可以这样做
获取完整的 URL。
The posted solution only work for Python 2.x, for Python 3.x you can do this
to get the full URL.
根据我的阅读,这些答案解决了第二个问题。我对第一个问题很感兴趣。查看大约 300 个缩短器的列表后,检测它们的最佳方法似乎是简单地将它们放入列表或正则表达式中,然后查找与其中任何一个的匹配项。
然后使用 r1 作为正则表达式来匹配您试图在(邮件等)中查找 url 缩短器的任何内容。
这里有一个非常好的列表:longurl.org/services
From what I have read, these answers addressed the second question. I was interested in the first question. After viewing a list of about 300 shorteners it seems the best way to detect them is to simply put them into a list or regex and look for a match with any of them.
Then using r1 to match as a regex against whatever you are trying to find the url shorteners in (mail, etc...)
A very good list is here: longurl.org/services