当前位置：文江博客话题详情

Python 中的规范 URL 比较？

发布于 2024-09-11 00:33:31 字数 252 浏览 14 评论 0原文

有没有可以用 Python 进行 URL 比较的工具？

例如，如果我有 http://google.com 和 google.com/，我想知道它们可能是同一个网站。

如果我要手动构建规则，我可能会大写它，然后去掉 http:// 部分，并在最后一个字母数字字符之后删除任何内容。但是我可以看到此方法的失败，我相信你也可以。

有图书馆可以做到这一点吗？你会怎么做？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

娇妻 2024-09-18 00:33:31

显然，创建规范网址需要相当多的工作。
url-normalize 库是我测试过的最好的。

根据网址的来源，您可能希望清除其中的其他标准参数，例如 UTM 代码。 w3lib.url.url_query_cleaner 对此很有用。

将此与Ned Batchelder的答案结合起来可能类似于：

代码：

from w3lib.url import url_query_cleaner
from url_normalize import url_normalize

urls = ['google.com',
'google.com/',
'http://google.com/',
'http://google.com',
'http://google.com?',
'http://google.com/?',
'http://google.com//',
'http://google.com?utm_source=Google']


def canonical_url(u):
    u = url_normalize(u)
    u = url_query_cleaner(u,parameterlist = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content'],remove=True)

    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("https://"):
        u = u[8:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

list(map(canonical_url,urls))

结果：< /强>

['google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com']

There is quite a bit to creating a canonical url apparently.
The url-normalize library is best that I have tested.

Depending on the source of your urls you may wish to clean them of other standard parameters such as UTM codes. w3lib.url.url_query_cleaner is useful for this.

Combining this with Ned Batchelder's answer could look something like:

Code:

from w3lib.url import url_query_cleaner
from url_normalize import url_normalize

urls = ['google.com',
'google.com/',
'http://google.com/',
'http://google.com',
'http://google.com?',
'http://google.com/?',
'http://google.com//',
'http://google.com?utm_source=Google']


def canonical_url(u):
    u = url_normalize(u)
    u = url_query_cleaner(u,parameterlist = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content'],remove=True)

    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("https://"):
        u = u[8:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

list(map(canonical_url,urls))

Result:

['google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com']

回复收藏 0 原文

为人所爱 2024-09-18 00:33:31

这是我突然想到的：

def canonical_url(u):
    u = u.lower()
    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

def same_urls(u1, u2):
    return canonical_url(u1) == canonical_url(u2)

显然，还有很大的空间可以对此进行更多的修改。正则表达式可能比startswith和endswith更好，但你明白了。

This off the top of my head:

def canonical_url(u):
    u = u.lower()
    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

def same_urls(u1, u2):
    return canonical_url(u1) == canonical_url(u2)

Obviously, there's lots of room for more fiddling with this. Regexes might be better than startswith and endswith, but you get the idea.

回复收藏 0 原文

撩心不撩汉 2024-09-18 00:33:31

您可以使用 dns 查找名称，看看它们是否指向同一个 ip。可能需要一些小的字符串处理来删除令人困惑的字符。

from socket import gethostbyname_ex

urls = ['http://google.com','google.com/','www.google.com/','news.google.com']

data = []
for orginalName in urls:
    print 'url:',orginalName
    name = orginalName.strip()
    name = name.replace( 'http://','')
    name = name.replace( 'http:','')
    if name.find('/') > 0:
        name = name[:name.find('/')]
    if name.find('\\') > 0:
        name = name[:name.find('\\')]
    print 'dns lookup:', name
    if name:
        try:
            result = gethostbyname_ex(name)
        except:
            continue # Unable to resolve
        for ip in result[2]:
            print 'ip:', ip
            data.append( (ip, orginalName) )

print data

结果：

url: http://google.com
dns lookup: google.com
ip: 66.102.11.104
url: google.com/
dns lookup: google.com
ip: 66.102.11.104
url: www.google.com/
dns lookup: www.google.com
ip: 66.102.11.104
url: news.google.com
dns lookup: news.google.com
ip: 66.102.11.104
[('66.102.11.104', 'http://google.com'), ('66.102.11.104', 'google.com/'), ('66.102.11.104', 'www.google.com/'), ('66.102.11.104', 'news.google.com')]

You could look up the names using dns and see if they point to the same ip. Some minor string processing may be required to remove confusing chars.

from socket import gethostbyname_ex

urls = ['http://google.com','google.com/','www.google.com/','news.google.com']

data = []
for orginalName in urls:
    print 'url:',orginalName
    name = orginalName.strip()
    name = name.replace( 'http://','')
    name = name.replace( 'http:','')
    if name.find('/') > 0:
        name = name[:name.find('/')]
    if name.find('\\') > 0:
        name = name[:name.find('\\')]
    print 'dns lookup:', name
    if name:
        try:
            result = gethostbyname_ex(name)
        except:
            continue # Unable to resolve
        for ip in result[2]:
            print 'ip:', ip
            data.append( (ip, orginalName) )

print data

result:

url: http://google.com
dns lookup: google.com
ip: 66.102.11.104
url: google.com/
dns lookup: google.com
ip: 66.102.11.104
url: www.google.com/
dns lookup: www.google.com
ip: 66.102.11.104
url: news.google.com
dns lookup: news.google.com
ip: 66.102.11.104
[('66.102.11.104', 'http://google.com'), ('66.102.11.104', 'google.com/'), ('66.102.11.104', 'www.google.com/'), ('66.102.11.104', 'news.google.com')]

回复收藏 0 原文

茶色山野 2024-09-18 00:33:31

@Antony 的原始答案很好，但我不喜欢它使用两个库（w3lib 和 url-normalize），而 url-normalize< /code> 自 2022 年 1 月以来一直没有维护。

w3lib 是 scrapy 的一部分，因此更容易维护/稳定，在其文档中我发现了以下方便的花花公子函数：

w3lib.url.canonicalize_url(url: Union[str, bytes, ParseResult], keep_blank_values: bool = True, keep_fragments: bool = False, 编码：Optional[str] = None) -> str
通过应用以下过程规范化给定的 URL：
确保网址安全
首先按键，然后按值对查询参数进行排序
规范化所有空格（在查询参数中）“+”（加号）
规范化百分比编码大小写 (%2f -> %2F)
删除具有空白值的查询参数（除非 keep_blank_values 为 True）
删除片段（除非 keep_fragments 为 True）
传递的 url 可以是字节或 unicode，而返回的 url 始终是本机 str（Python 2 中为字节，Python 3 中为 unicode）。

用法

import w3lib.url

# sorting query arguments
w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50')
'http://www.example.com/do?a=50&b=2&b=5&c=3'

# UTF-8 conversion + percent-encoding of non-ASCII characters
w3lib.url.canonicalize_url('http://www.example.com/r\u00e9sum\u00e9')
'http://www.example.com/r%C3%A9sum%C3%A9'

来源

我真的很看重一个稳定的解决方案，因为与依赖多个奇怪库的库相反，因为我的用例是通过散列规范化 URL 来生成唯一的 id。

结合起来

def canonical_url(u):
    u = u.lower()
    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

顺便说一句，我认为这个功能可以与@Ned Batchelder 的答案

The original answer from @Antony is good, but I didn't like that it uses two libraries (w3lib and url-normalize), while url-normalize hasn't been maintained since Jan 2022.

w3lib is part of scrapy and thus more maintained/stable, and in its doc I found the following handy dandy function:

w3lib.url.canonicalize_url(url: Union[str, bytes, ParseResult], keep_blank_values: bool = True, keep_fragments: bool = False, encoding: Optional[str] = None) -> str
Canonicalize the given url by applying the following procedures:
make the URL safe
sort query arguments, first by key, then by value
normalize all spaces (in query arguments) ‘+’ (plus symbol)
normalize percent encodings case (%2f -> %2F)
remove query arguments with blank values (unless keep_blank_values is True)
remove fragments (unless keep_fragments is True)
The url passed can be bytes or unicode, while the url returned is always a native str (bytes in Python 2, unicode in Python 3).

Usage

import w3lib.url

# sorting query arguments
w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50')
'http://www.example.com/do?a=50&b=2&b=5&c=3'

# UTF-8 conversion + percent-encoding of non-ASCII characters
w3lib.url.canonicalize_url('http://www.example.com/r\u00e9sum\u00e9')
'http://www.example.com/r%C3%A9sum%C3%A9'

Source

I really value a stable solution, as opposed to one that relies on multiple odd libraries, since my use-case is to generate unique ids by hashing the canonicalised URL.

Btw, I think this function could be combined with

def canonical_url(u):
    u = u.lower()
    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

from @Ned Batchelder 's answer

回复收藏 0 原文