URL重定向问题

发布于 2024-08-21 00:47:45 字数 3093 浏览 7 评论 0原文

我有以下网址

http://bit.ly/cDdh1c

当您将上述网址放入浏览器中并按 Enter 它将重定向到以下网址 http://www.kennystopproducts.info/Top/?hop=arnishad

但是当我尝试找到上述相同网址的基本网址(消除所有重定向网址后)时 http://bit .ly/cDdh1c 通过Python程序(下面你可以看到代码)我得到以下网址http: //www.cbtrends.com/ 作为基本 url。请参阅下面的日志文件

为什么相同的 url 在浏览器和 python 程序中表现不同。我应该在 python 程序中更改什么以便它可以重定向到正确的网址?我想知道这种奇怪的行为是如何发生的。?

我观察到类似行为的其他网址是

  1. http://bit.ly/bEKyOx ====> http://cgi.ebay.com/ws/eBayISAPI.dll? ViewItem&item=150413977509 (通过浏览器)
  2. http://www.ebay.com (通过 python 程序)

    <预置><代码> 最大尝试次数 = 5 网址 = 网址 而(最大尝试次数> 0): 主机,路径 = urlparse.urlsplit(turl)[1:3] 如果 len(host.strip()) == 0 : 返回无 尝试: 连接 = httplib.HTTPConnection(主机,超时=10) 连接.请求(“HEAD”,路径) resp = 连接.getresponse() 除了: 返回无 最大尝试次数 = 最大尝试次数 - 1 如果(resp.status >= 300)并且(resp.status <= 399): self.logger.debug("当前的 %s 是重定向的" %turl) turl = resp.getheader('位置') elif (resp.status >= 200) 和 (resp.status <= 299) : self.logger.debug("当前的 url %s 是正确的" %turl) 返回网址 别的 : #这个网址有问题 返回无 返回无

日志文件供您参考

2010-02-14 10:29:43,260 - paypallistener.views.MrCrawler - INFO - Bringing down the base URL
2010-02-14 10:29:43,261 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://bit.ly/cDdh1c
2010-02-14 10:29:43,994 - paypallistener.views.MrCrawler - DEBUG - The present http://bit.ly/cDdh1c is a redirection one
2010-02-14 10:29:43,995 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter
2010-02-14 10:29:44,606 - paypallistener.views.MrCrawler - DEBUG - The present http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter is a redirection one
2010-02-14 10:29:44,607 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/
2010-02-14 10:29:45,547 - paypallistener.views.MrCrawler - DEBUG - The present url http://www.cbtrends.com/ is a proper one
http://www.cbtrends.com/

i have the below url

http://bit.ly/cDdh1c

When you place the above url in a browser and hit enter it will redirect to the below url
http://www.kennystopproducts.info/Top/?hop=arnishad

But where as when i try to find the base url (after eliminating all the redirect urls) for the same above url http://bit.ly/cDdh1c via a python program (below you can see the code) iam getting the following url http://www.cbtrends.com/ as base url.Please see the log file below

Why the same url is behaving different with browser and with a python program.What should i change in the python program so that it can redirect to the proper url?Iam wondering how this strange behaviour can happen.?

Other url for which iam observing similar behaviour is

  1. http://bit.ly/bEKyOx ====>
    http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=150413977509
    ( via browser)
  2. http://www.ebay.com (via python
    program)

          maxattempts = 5
          turl = url
          while (maxattempts  >  0) :               
            host,path = urlparse.urlsplit(turl)[1:3]
            if  len(host.strip()) == 0 :
               return None
    
            try: 
                    connection = httplib.HTTPConnection(host,timeout=10)
                    connection.request("HEAD", path)
                    resp = connection.getresponse()                      
            except:                         
                     return None                     
            maxattempts = maxattempts - 1
            if (resp.status >= 300) and (resp.status <= 399):
                self.logger.debug("The present %s is a redirection one" %turl)
                turl = resp.getheader('location')
            elif (resp.status >= 200) and (resp.status <= 299) :
                self.logger.debug("The present url %s is a proper one" %turl)
                return turl
            else :
                #some problem with this url
                return None               
          return None
    

Log file for your reference

2010-02-14 10:29:43,260 - paypallistener.views.MrCrawler - INFO - Bringing down the base URL
2010-02-14 10:29:43,261 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://bit.ly/cDdh1c
2010-02-14 10:29:43,994 - paypallistener.views.MrCrawler - DEBUG - The present http://bit.ly/cDdh1c is a redirection one
2010-02-14 10:29:43,995 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter
2010-02-14 10:29:44,606 - paypallistener.views.MrCrawler - DEBUG - The present http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter is a redirection one
2010-02-14 10:29:44,607 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/
2010-02-14 10:29:45,547 - paypallistener.views.MrCrawler - DEBUG - The present url http://www.cbtrends.com/ is a proper one
http://www.cbtrends.com/

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

沙与沫 2024-08-28 00:47:46

您的问题是,当您调用 urlsplit 时,您的路径变量仅包含路径并且缺少查询。

因此,请尝试:

import httplib
import urlparse

def getUrl(url):
    maxattempts = 10
    turl = url
    while (maxattempts  >  0) :               
        host,path,query = urlparse.urlsplit(turl)[1:4]
        if  len(host.strip()) == 0 :
            return None
        try: 
            connection = httplib.HTTPConnection(host,timeout=10)
            connection.request("GET", path+'?'+query)
            resp = connection.getresponse()
        except:                         
            return None                     
        maxattempts = maxattempts - 1
        if (resp.status >= 300) and (resp.status <= 399):
            turl = resp.getheader('location')
        elif (resp.status >= 200) and (resp.status <= 299) :
            return turl
        else :
            #some problem with this url
            return None               
    return None
print getUrl('http://bit.ly/cDdh1c')

Your problem is that when you call urlsplit, your path variable only contains the path and is missing the query.

So, instead try:

import httplib
import urlparse

def getUrl(url):
    maxattempts = 10
    turl = url
    while (maxattempts  >  0) :               
        host,path,query = urlparse.urlsplit(turl)[1:4]
        if  len(host.strip()) == 0 :
            return None
        try: 
            connection = httplib.HTTPConnection(host,timeout=10)
            connection.request("GET", path+'?'+query)
            resp = connection.getresponse()
        except:                         
            return None                     
        maxattempts = maxattempts - 1
        if (resp.status >= 300) and (resp.status <= 399):
            turl = resp.getheader('location')
        elif (resp.status >= 200) and (resp.status <= 299) :
            return turl
        else :
            #some problem with this url
            return None               
    return None
print getUrl('http://bit.ly/cDdh1c')
遗失的美好 2024-08-28 00:47:46

您的问题来自这一行:

host,path = urlparse.urlsplit(turl)[1:3]

您遗漏了查询字符串。因此,在您提供的示例日志中,您将执行的第二个 HEAD 请求将位于 http://www.cbtrends.com/get-product.html 上,而无需GET 参数。在浏览器中打开该 URL,您将看到它重定向到 http://www.cbtrends.com/

您必须使用 urlsplit 返回的元组的所有元素来计算路径。

parts = urlparse.urlsplit(turl)
host = parts[1]
path = "%s?%s#%s" % parts[2:5]

Your problem comes from this line :

host,path = urlparse.urlsplit(turl)[1:3]

You're leaving out the query string. So on the example log you're providing, the second HEAD request you will do will be on http://www.cbtrends.com/get-product.html without the GET parameters. Open that URL in your browser and you'll see it redirects to http://www.cbtrends.com/.

You have to calculate the path using all elements of the tuple returned by urlsplit.

parts = urlparse.urlsplit(turl)
host = parts[1]
path = "%s?%s#%s" % parts[2:5]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文