URL重定向问题
我有以下网址
当您将上述网址放入浏览器中并按 Enter 它将重定向到以下网址 http://www.kennystopproducts.info/Top/?hop=arnishad
但是当我尝试找到上述相同网址的基本网址(消除所有重定向网址后)时 http://bit .ly/cDdh1c 通过Python程序(下面你可以看到代码)我得到以下网址http: //www.cbtrends.com/ 作为基本 url。请参阅下面的日志文件
为什么相同的 url 在浏览器和 python 程序中表现不同。我应该在 python 程序中更改什么以便它可以重定向到正确的网址?我想知道这种奇怪的行为是如何发生的。?
我观察到类似行为的其他网址是
- http://bit.ly/bEKyOx ====> http://cgi.ebay.com/ws/eBayISAPI.dll? ViewItem&item=150413977509 (通过浏览器)
http://www.ebay.com (通过 python 程序)
<预置><代码> 最大尝试次数 = 5 网址 = 网址 而(最大尝试次数> 0): 主机,路径 = urlparse.urlsplit(turl)[1:3] 如果 len(host.strip()) == 0 : 返回无 尝试: 连接 = httplib.HTTPConnection(主机,超时=10) 连接.请求(“HEAD”,路径) resp = 连接.getresponse() 除了: 返回无 最大尝试次数 = 最大尝试次数 - 1 如果(resp.status >= 300)并且(resp.status <= 399): self.logger.debug("当前的 %s 是重定向的" %turl) turl = resp.getheader('位置') elif (resp.status >= 200) 和 (resp.status <= 299) : self.logger.debug("当前的 url %s 是正确的" %turl) 返回网址 别的 : #这个网址有问题 返回无 返回无
日志文件供您参考
2010-02-14 10:29:43,260 - paypallistener.views.MrCrawler - INFO - Bringing down the base URL
2010-02-14 10:29:43,261 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://bit.ly/cDdh1c
2010-02-14 10:29:43,994 - paypallistener.views.MrCrawler - DEBUG - The present http://bit.ly/cDdh1c is a redirection one
2010-02-14 10:29:43,995 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter
2010-02-14 10:29:44,606 - paypallistener.views.MrCrawler - DEBUG - The present http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter is a redirection one
2010-02-14 10:29:44,607 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/
2010-02-14 10:29:45,547 - paypallistener.views.MrCrawler - DEBUG - The present url http://www.cbtrends.com/ is a proper one
http://www.cbtrends.com/
i have the below url
When you place the above url in a browser and hit enter it will redirect to the below url
http://www.kennystopproducts.info/Top/?hop=arnishad
But where as when i try to find the base url (after eliminating all the redirect urls) for the same above url http://bit.ly/cDdh1c via a python program (below you can see the code) iam getting the following url http://www.cbtrends.com/ as base url.Please see the log file below
Why the same url is behaving different with browser and with a python program.What should i change in the python program so that it can redirect to the proper url?Iam wondering how this strange behaviour can happen.?
Other url for which iam observing similar behaviour is
- http://bit.ly/bEKyOx ====>
http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=150413977509
( via browser) http://www.ebay.com (via python
program)maxattempts = 5 turl = url while (maxattempts > 0) : host,path = urlparse.urlsplit(turl)[1:3] if len(host.strip()) == 0 : return None try: connection = httplib.HTTPConnection(host,timeout=10) connection.request("HEAD", path) resp = connection.getresponse() except: return None maxattempts = maxattempts - 1 if (resp.status >= 300) and (resp.status <= 399): self.logger.debug("The present %s is a redirection one" %turl) turl = resp.getheader('location') elif (resp.status >= 200) and (resp.status <= 299) : self.logger.debug("The present url %s is a proper one" %turl) return turl else : #some problem with this url return None return None
Log file for your reference
2010-02-14 10:29:43,260 - paypallistener.views.MrCrawler - INFO - Bringing down the base URL
2010-02-14 10:29:43,261 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://bit.ly/cDdh1c
2010-02-14 10:29:43,994 - paypallistener.views.MrCrawler - DEBUG - The present http://bit.ly/cDdh1c is a redirection one
2010-02-14 10:29:43,995 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter
2010-02-14 10:29:44,606 - paypallistener.views.MrCrawler - DEBUG - The present http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter is a redirection one
2010-02-14 10:29:44,607 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/
2010-02-14 10:29:45,547 - paypallistener.views.MrCrawler - DEBUG - The present url http://www.cbtrends.com/ is a proper one
http://www.cbtrends.com/
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您的问题是,当您调用 urlsplit 时,您的路径变量仅包含路径并且缺少查询。
因此,请尝试:
Your problem is that when you call urlsplit, your path variable only contains the path and is missing the query.
So, instead try:
您的问题来自这一行:
您遗漏了查询字符串。因此,在您提供的示例日志中,您将执行的第二个
HEAD
请求将位于http://www.cbtrends.com/get-product.html
上,而无需GET 参数。在浏览器中打开该 URL,您将看到它重定向到http://www.cbtrends.com/
。您必须使用
urlsplit
返回的元组的所有元素来计算路径。Your problem comes from this line :
You're leaving out the query string. So on the example log you're providing, the second
HEAD
request you will do will be onhttp://www.cbtrends.com/get-product.html
without the GET parameters. Open that URL in your browser and you'll see it redirects tohttp://www.cbtrends.com/
.You have to calculate the path using all elements of the tuple returned by
urlsplit
.