Python 中的规范 URL 比较?
有没有可以用 Python 进行 URL 比较的工具?
例如,如果我有 http://google.com
和 google.com/
,我想知道它们可能是同一个网站。
如果我要手动构建规则,我可能会大写它,然后去掉 http://
部分,并在最后一个字母数字字符之后删除任何内容。但是我可以看到此方法的失败,我相信你也可以。
有图书馆可以做到这一点吗?你会怎么做?
Are there any tools to do a URL compare in Python?
For example, if I have http://google.com
and google.com/
I'd like to know that they are likely to be the same site.
If I were to construct a rule manually, I might Uppercase it, then strip off the http://
portion, and drop anything after the last alpha-numeric character.. But I can see failures of this, as I'm sure you can as well.
Is there a library that does this? How would you do it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
显然,创建规范网址需要相当多的工作。
url-normalize 库是我测试过的最好的。
根据网址的来源,您可能希望清除其中的其他标准参数,例如 UTM 代码。 w3lib.url.url_query_cleaner 对此很有用。
将此与Ned Batchelder的答案结合起来可能类似于:
代码:
结果:< /强>
There is quite a bit to creating a canonical url apparently.
The url-normalize library is best that I have tested.
Depending on the source of your urls you may wish to clean them of other standard parameters such as UTM codes. w3lib.url.url_query_cleaner is useful for this.
Combining this with Ned Batchelder's answer could look something like:
Code:
Result:
这是我突然想到的:
显然,还有很大的空间可以对此进行更多的修改。正则表达式可能比startswith和endswith更好,但你明白了。
This off the top of my head:
Obviously, there's lots of room for more fiddling with this. Regexes might be better than startswith and endswith, but you get the idea.
您可以使用 dns 查找名称,看看它们是否指向同一个 ip。可能需要一些小的字符串处理来删除令人困惑的字符。
结果:
You could look up the names using dns and see if they point to the same ip. Some minor string processing may be required to remove confusing chars.
result:
@Antony 的原始答案很好,但我不喜欢它使用两个库(
w3lib
和url-normalize
),而url-normalize< /code> 自 2022 年 1 月以来一直没有维护。
w3lib
是scrapy
的一部分,因此更容易维护/稳定,在其文档中我发现了以下方便的花花公子函数:用法
来源
我真的很看重一个稳定的解决方案,因为与依赖多个奇怪库的库相反,因为我的用例是通过散列规范化 URL 来生成唯一的 id。
结合起来
顺便说一句,我认为这个功能可以与@Ned Batchelder 的答案
The original answer from @Antony is good, but I didn't like that it uses two libraries (
w3lib
andurl-normalize
), whileurl-normalize
hasn't been maintained since Jan 2022.w3lib
is part ofscrapy
and thus more maintained/stable, and in its doc I found the following handy dandy function:Usage
Source
I really value a stable solution, as opposed to one that relies on multiple odd libraries, since my use-case is to generate unique ids by hashing the canonicalised URL.
Btw, I think this function could be combined with
from @Ned Batchelder 's answer
它不是“模糊”,它只是找到两个字符串之间的“距离”:
http:// pypi.python.org/pypi/python-Levenshtein/
我将删除对 URL 解析有语义意义的所有部分(协议、斜线等),标准化为小写,然后执行 levenstein 距离,然后从那里开始确定多少差异是可接受的阈值。
只是一个想法。
It's not 'fuzzy', it just find the 'distance' between two strings:
http://pypi.python.org/pypi/python-Levenshtein/
I would remove all portions which are semantically meaningful to URL parsing (protocol, slashes, etc.), normalize to lowercase, then perform a levenstein distance, then from there decide how many difference is an acceptable threshold.
Just an idea.