URL重定向问题

发布于 2024-08-21 00:47:45 字数 3093 浏览 7 评论 0原文

我有以下网址

当您将上述网址放入浏览器中并按 Enter 它将重定向到以下网址 http://www.kennystopproducts.info/Top/?hop=arnishad

但是当我尝试找到上述相同网址的基本网址（消除所有重定向网址后）时 http://bit .ly/cDdh1c 通过Python程序（下面你可以看到代码）我得到以下网址http： //www.cbtrends.com/ 作为基本 url。请参阅下面的日志文件

为什么相同的 url 在浏览器和 python 程序中表现不同。我应该在 python 程序中更改什么以便它可以重定向到正确的网址？我想知道这种奇怪的行为是如何发生的。？

我观察到类似行为的其他网址是

http://bit.ly/bEKyOx ====> http://cgi.ebay.com/ws/eBayISAPI.dll？ ViewItem&item=150413977509 （通过浏览器）
http://www.ebay.com （通过 python 程序）
<预置><代码> 最大尝试次数 = 5 网址 = 网址而（最大尝试次数> 0）：主机，路径 = urlparse.urlsplit(turl)[1:3] 如果 len(host.strip()) == 0 ：返回无尝试：连接 = httplib.HTTPConnection(主机,超时=10) 连接.请求（“HEAD”，路径） resp = 连接.getresponse() 除了：返回无最大尝试次数 = 最大尝试次数 - 1 如果（resp.status >= 300）并且（resp.status <= 399）： self.logger.debug("当前的 %s 是重定向的" %turl) turl = resp.getheader('位置') elif (resp.status >= 200) 和 (resp.status <= 299) ： self.logger.debug("当前的 url %s 是正确的" %turl) 返回网址别的： #这个网址有问题返回无返回无

日志文件供您参考

2010-02-14 10:29:43,260 - paypallistener.views.MrCrawler - INFO - Bringing down the base URL
2010-02-14 10:29:43,261 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://bit.ly/cDdh1c
2010-02-14 10:29:43,994 - paypallistener.views.MrCrawler - DEBUG - The present http://bit.ly/cDdh1c is a redirection one
2010-02-14 10:29:43,995 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter
2010-02-14 10:29:44,606 - paypallistener.views.MrCrawler - DEBUG - The present http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter is a redirection one
2010-02-14 10:29:44,607 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/
2010-02-14 10:29:45,547 - paypallistener.views.MrCrawler - DEBUG - The present url http://www.cbtrends.com/ is a proper one
http://www.cbtrends.com/

原文

i have the below url

http://bit.ly/cDdh1c

When you place the above url in a browser and hit enter it will redirect to the below url
http://www.kennystopproducts.info/Top/?hop=arnishad

But where as when i try to find the base url (after eliminating all the redirect urls) for the same above url http://bit.ly/cDdh1c via a python program (below you can see the code) iam getting the following url http://www.cbtrends.com/ as base url.Please see the log file below

Why the same url is behaving different with browser and with a python program.What should i change in the python program so that it can redirect to the proper url?Iam wondering how this strange behaviour can happen.?

Other url for which iam observing similar behaviour is

http://bit.ly/bEKyOx ====>
http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=150413977509
( via browser)

http://www.ebay.com (via python
program)

      maxattempts = 5
      turl = url
      while (maxattempts  >  0) :               
        host,path = urlparse.urlsplit(turl)[1:3]
        if  len(host.strip()) == 0 :
           return None

        try: 
                connection = httplib.HTTPConnection(host,timeout=10)
                connection.request("HEAD", path)
                resp = connection.getresponse()                      
        except:                         
                 return None                     
        maxattempts = maxattempts - 1
        if (resp.status >= 300) and (resp.status <= 399):
            self.logger.debug("The present %s is a redirection one" %turl)
            turl = resp.getheader('location')
        elif (resp.status >= 200) and (resp.status <= 299) :
            self.logger.debug("The present url %s is a proper one" %turl)
            return turl
        else :
            #some problem with this url
            return None               
      return None

Log file for your reference

2010-02-14 10:29:43,260 - paypallistener.views.MrCrawler - INFO - Bringing down the base URL
2010-02-14 10:29:43,261 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://bit.ly/cDdh1c
2010-02-14 10:29:43,994 - paypallistener.views.MrCrawler - DEBUG - The present http://bit.ly/cDdh1c is a redirection one
2010-02-14 10:29:43,995 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter
2010-02-14 10:29:44,606 - paypallistener.views.MrCrawler - DEBUG - The present http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter is a redirection one
2010-02-14 10:29:44,607 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/
2010-02-14 10:29:45,547 - paypallistener.views.MrCrawler - DEBUG - The present url http://www.cbtrends.com/ is a proper one
http://www.cbtrends.com/

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沙与沫 2024-08-28 00:47:46

您的问题是，当您调用 urlsplit 时，您的路径变量仅包含路径并且缺少查询。

因此，请尝试：

import httplib
import urlparse

def getUrl(url):
    maxattempts = 10
    turl = url
    while (maxattempts  >  0) :               
        host,path,query = urlparse.urlsplit(turl)[1:4]
        if  len(host.strip()) == 0 :
            return None
        try: 
            connection = httplib.HTTPConnection(host,timeout=10)
            connection.request("GET", path+'?'+query)
            resp = connection.getresponse()
        except:                         
            return None                     
        maxattempts = maxattempts - 1
        if (resp.status >= 300) and (resp.status <= 399):
            turl = resp.getheader('location')
        elif (resp.status >= 200) and (resp.status <= 299) :
            return turl
        else :
            #some problem with this url
            return None               
    return None
print getUrl('http://bit.ly/cDdh1c')

Your problem is that when you call urlsplit, your path variable only contains the path and is missing the query.

So, instead try:

import httplib
import urlparse

def getUrl(url):
    maxattempts = 10
    turl = url
    while (maxattempts  >  0) :               
        host,path,query = urlparse.urlsplit(turl)[1:4]
        if  len(host.strip()) == 0 :
            return None
        try: 
            connection = httplib.HTTPConnection(host,timeout=10)
            connection.request("GET", path+'?'+query)
            resp = connection.getresponse()
        except:                         
            return None                     
        maxattempts = maxattempts - 1
        if (resp.status >= 300) and (resp.status <= 399):
            turl = resp.getheader('location')
        elif (resp.status >= 200) and (resp.status <= 299) :
            return turl
        else :
            #some problem with this url
            return None               
    return None
print getUrl('http://bit.ly/cDdh1c')

回复收藏 0 原文

遗失的美好 2024-08-28 00:47:46

您的问题来自这一行：

host,path = urlparse.urlsplit(turl)[1:3]

您遗漏了查询字符串。因此，在您提供的示例日志中，您将执行的第二个 HEAD 请求将位于 http://www.cbtrends.com/get-product.html 上，而无需GET 参数。在浏览器中打开该 URL，您将看到它重定向到 http://www.cbtrends.com/。

您必须使用 urlsplit 返回的元组的所有元素来计算路径。

parts = urlparse.urlsplit(turl)
host = parts[1]
path = "%s?%s#%s" % parts[2:5]

Your problem comes from this line :

host,path = urlparse.urlsplit(turl)[1:3]

You're leaving out the query string. So on the example log you're providing, the second HEAD request you will do will be on http://www.cbtrends.com/get-product.html without the GET parameters. Open that URL in your browser and you'll see it redirects to http://www.cbtrends.com/.

You have to calculate the path using all elements of the tuple returned by urlsplit.

parts = urlparse.urlsplit(turl)
host = parts[1]
path = "%s?%s#%s" % parts[2:5]

回复收藏 0 原文

~没有更多了~

关于作者

海未深

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

URL重定向问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

URL重定向问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。