在Python中,我如何检查两个不同的链接是否实际上指向同一页面?

发布于 2024-11-13 06:04:29 字数 534 浏览 1 评论 0原文

例如,这两个链接指向同一位置:

http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html

http://www.independent .co.uk/life-style/gadgets-and-tech/news/2292113.html

我如何在Python中检查这个?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

抠脚大汉 2024-11-20 06:04:29

urllib2.urlopen() 的结果调用 geturl()geturl() “返回检索到的资源的 URL,通常用于确定是否遵循重定向。”

例如:

#!/usr/bin/env python
# coding: utf-8

import urllib2

url1 = 'http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html'
url2 = 'http://www.independent.co.uk/life-style/gadgets-and-tech/news/2292113.html'

for url in [url1, url2]:
    result = urllib2.urlopen(url)
    print result.geturl()

输出为:

http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html
http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html

Call geturl() on the result of urllib2.urlopen(). geturl() "returns the URL of the resource retrieved, commonly used to determine if a redirect was followed."

For example:

#!/usr/bin/env python
# coding: utf-8

import urllib2

url1 = 'http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html'
url2 = 'http://www.independent.co.uk/life-style/gadgets-and-tech/news/2292113.html'

for url in [url1, url2]:
    result = urllib2.urlopen(url)
    print result.geturl()

The output is:

http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html
http://www.independent.co.uk/life-style/gadgets-and-tech/news/chinese-blamed-for-gmail-hacking-2292113.html
三生路 2024-11-20 06:04:29

显然,仅从 URL 无法辨别这一点。

您可以获取内容并进行比较,但我想您必须使用一个智能标准来决定两个页面何时相同 - 例如,两个页面都指向同一篇文章,但出现随机广告不同的或相关的文章会根据其他因素而变化。

以这样一种方式设计您的程序,即匹配页面的标准可以轻松替换,甚至是动态替换,并尝试直到找到一个不会失败的标准 - 例如,对于报纸页面,您可以尝试查找标题。

It's impossible to discern that merely from the URLs, obviously.

You could fetch the content and compare it, but then I imagine you'd have to use a smart criterion to decide when two pages are the same -- say, for example, that both point to the same article, but a random advertising comes different, or related articles change depending on other factors.

Design your program in such a way that the criterion for matching pages is easily replaced, even dynamically, and try until you find one that doesn't fail -- for example, for a newspaper page, you could try finding headlines.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文