Python 中的规范 URL 比较?

发布于 2024-09-11 00:33:31 字数 252 浏览 4 评论 0原文

有没有可以用 Python 进行 URL 比较的工具?

例如,如果我有 http://google.comgoogle.com/,我想知道它们可能是同一个网站。

如果我要手动构建规则,我可能会大写它,然后去掉 http:// 部分,并在最后一个字母数字字符之后删除任何内容。但是我可以看到此方法的失败,我相信你也可以。

有图书馆可以做到这一点吗?你会怎么做?

Are there any tools to do a URL compare in Python?

For example, if I have http://google.com and google.com/ I'd like to know that they are likely to be the same site.

If I were to construct a rule manually, I might Uppercase it, then strip off the http:// portion, and drop anything after the last alpha-numeric character.. But I can see failures of this, as I'm sure you can as well.

Is there a library that does this? How would you do it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

娇妻 2024-09-18 00:33:31

显然,创建规范网址需要相当多的工作
url-normalize 库是我测试过的最好的。

根据网址的来源,您可能希望清除其中的其他标准参数,例如 UTM 代码w3lib.url.url_query_cleaner 对此很有用。

将此与Ned Batchelder的答案结合起来可能类似于:

代码:

from w3lib.url import url_query_cleaner
from url_normalize import url_normalize

urls = ['google.com',
'google.com/',
'http://google.com/',
'http://google.com',
'http://google.com?',
'http://google.com/?',
'http://google.com//',
'http://google.com?utm_source=Google']


def canonical_url(u):
    u = url_normalize(u)
    u = url_query_cleaner(u,parameterlist = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content'],remove=True)

    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("https://"):
        u = u[8:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

list(map(canonical_url,urls))

结果:< /强>

['google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com']

There is quite a bit to creating a canonical url apparently.
The url-normalize library is best that I have tested.

Depending on the source of your urls you may wish to clean them of other standard parameters such as UTM codes. w3lib.url.url_query_cleaner is useful for this.

Combining this with Ned Batchelder's answer could look something like:

Code:

from w3lib.url import url_query_cleaner
from url_normalize import url_normalize

urls = ['google.com',
'google.com/',
'http://google.com/',
'http://google.com',
'http://google.com?',
'http://google.com/?',
'http://google.com//',
'http://google.com?utm_source=Google']


def canonical_url(u):
    u = url_normalize(u)
    u = url_query_cleaner(u,parameterlist = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content'],remove=True)

    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("https://"):
        u = u[8:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

list(map(canonical_url,urls))

Result:

['google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com',
 'google.com']
为人所爱 2024-09-18 00:33:31

这是我突然想到的:

def canonical_url(u):
    u = u.lower()
    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

def same_urls(u1, u2):
    return canonical_url(u1) == canonical_url(u2)

显然,还有很大的空间可以对此进行更多的修改。正则表达式可能比startswith和endswith更好,但你明白了。

This off the top of my head:

def canonical_url(u):
    u = u.lower()
    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

def same_urls(u1, u2):
    return canonical_url(u1) == canonical_url(u2)

Obviously, there's lots of room for more fiddling with this. Regexes might be better than startswith and endswith, but you get the idea.

撩心不撩汉 2024-09-18 00:33:31

您可以使用 dns 查找名称,看看它们是否指向同一个 ip。可能需要一些小的字符串处理来删除令人困惑的字符。

from socket import gethostbyname_ex

urls = ['http://google.com','google.com/','www.google.com/','news.google.com']

data = []
for orginalName in urls:
    print 'url:',orginalName
    name = orginalName.strip()
    name = name.replace( 'http://','')
    name = name.replace( 'http:','')
    if name.find('/') > 0:
        name = name[:name.find('/')]
    if name.find('\\') > 0:
        name = name[:name.find('\\')]
    print 'dns lookup:', name
    if name:
        try:
            result = gethostbyname_ex(name)
        except:
            continue # Unable to resolve
        for ip in result[2]:
            print 'ip:', ip
            data.append( (ip, orginalName) )

print data

结果:

url: http://google.com
dns lookup: google.com
ip: 66.102.11.104
url: google.com/
dns lookup: google.com
ip: 66.102.11.104
url: www.google.com/
dns lookup: www.google.com
ip: 66.102.11.104
url: news.google.com
dns lookup: news.google.com
ip: 66.102.11.104
[('66.102.11.104', 'http://google.com'), ('66.102.11.104', 'google.com/'), ('66.102.11.104', 'www.google.com/'), ('66.102.11.104', 'news.google.com')]

You could look up the names using dns and see if they point to the same ip. Some minor string processing may be required to remove confusing chars.

from socket import gethostbyname_ex

urls = ['http://google.com','google.com/','www.google.com/','news.google.com']

data = []
for orginalName in urls:
    print 'url:',orginalName
    name = orginalName.strip()
    name = name.replace( 'http://','')
    name = name.replace( 'http:','')
    if name.find('/') > 0:
        name = name[:name.find('/')]
    if name.find('\\') > 0:
        name = name[:name.find('\\')]
    print 'dns lookup:', name
    if name:
        try:
            result = gethostbyname_ex(name)
        except:
            continue # Unable to resolve
        for ip in result[2]:
            print 'ip:', ip
            data.append( (ip, orginalName) )

print data

result:

url: http://google.com
dns lookup: google.com
ip: 66.102.11.104
url: google.com/
dns lookup: google.com
ip: 66.102.11.104
url: www.google.com/
dns lookup: www.google.com
ip: 66.102.11.104
url: news.google.com
dns lookup: news.google.com
ip: 66.102.11.104
[('66.102.11.104', 'http://google.com'), ('66.102.11.104', 'google.com/'), ('66.102.11.104', 'www.google.com/'), ('66.102.11.104', 'news.google.com')]
茶色山野 2024-09-18 00:33:31

@Antony 的原始答案很好,但我不喜欢它使用两个库(w3liburl-normalize),而 url-normalize< /code> 自 2022 年 1 月以来一直没有维护。

w3libscrapy 的一部分,因此更容易维护/稳定,在其文档中我发现了以下方便的花花公子函数:

w3lib.url.canonicalize_url(url: Union[str, bytes, ParseResult], keep_blank_values: bool = True, keep_fragments: bool = False, 编码:Optional[str] = None) -> str

通过应用以下过程规范化给定的 URL:

  • 确保网址安全
  • 首先按键,然后按值对查询参数进行排序
  • 规范化所有空格(在查询参数中)“+”(加号)
  • 规范化百分比编码大小写 (%2f -> %2F)
  • 删除具有空白值的查询参数(除非 keep_blank_values 为 True)
  • 删除片段(除非 keep_fragments 为 True)

传递的 url 可以是字节或 unicode,而返回的 url 始终是本机 str(Python 2 中为字节,Python 3 中为 unicode)。

用法

import w3lib.url

# sorting query arguments
w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50')
'http://www.example.com/do?a=50&b=2&b=5&c=3'

# UTF-8 conversion + percent-encoding of non-ASCII characters
w3lib.url.canonicalize_url('http://www.example.com/r\u00e9sum\u00e9')
'http://www.example.com/r%C3%A9sum%C3%A9'

来源

我真的很看重一个稳定的解决方案,因为与依赖多个奇怪库的库相反,因为我的用例是通过散列规范化 URL 来生成唯一的 id。

结合起来

def canonical_url(u):
    u = u.lower()
    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

顺便说一句,我认为这个功能可以与@Ned Batchelder 的答案

The original answer from @Antony is good, but I didn't like that it uses two libraries (w3lib and url-normalize), while url-normalize hasn't been maintained since Jan 2022.

w3lib is part of scrapy and thus more maintained/stable, and in its doc I found the following handy dandy function:

w3lib.url.canonicalize_url(url: Union[str, bytes, ParseResult], keep_blank_values: bool = True, keep_fragments: bool = False, encoding: Optional[str] = None) -> str

Canonicalize the given url by applying the following procedures:

  • make the URL safe
  • sort query arguments, first by key, then by value
  • normalize all spaces (in query arguments) ‘+’ (plus symbol)
  • normalize percent encodings case (%2f -> %2F)
  • remove query arguments with blank values (unless keep_blank_values is True)
  • remove fragments (unless keep_fragments is True)

The url passed can be bytes or unicode, while the url returned is always a native str (bytes in Python 2, unicode in Python 3).

Usage

import w3lib.url

# sorting query arguments
w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50')
'http://www.example.com/do?a=50&b=2&b=5&c=3'

# UTF-8 conversion + percent-encoding of non-ASCII characters
w3lib.url.canonicalize_url('http://www.example.com/r\u00e9sum\u00e9')
'http://www.example.com/r%C3%A9sum%C3%A9'

Source

I really value a stable solution, as opposed to one that relies on multiple odd libraries, since my use-case is to generate unique ids by hashing the canonicalised URL.

Btw, I think this function could be combined with

def canonical_url(u):
    u = u.lower()
    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

from @Ned Batchelder 's answer

花期渐远 2024-09-18 00:33:31

它不是“模糊”,它只是找到两个字符串之间的“距离”:

http:// pypi.python.org/pypi/python-Levenshtein/

我将删除对 URL 解析有语义意义的所有部分(协议、斜线等),标准化为小写,然后执行 levenstein 距离,然后从那里开始确定多少差异是可接受的阈值。

只是一个想法。

It's not 'fuzzy', it just find the 'distance' between two strings:

http://pypi.python.org/pypi/python-Levenshtein/

I would remove all portions which are semantically meaningful to URL parsing (protocol, slashes, etc.), normalize to lowercase, then perform a levenstein distance, then from there decide how many difference is an acceptable threshold.

Just an idea.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文