URL路径相似度/字符串相似度算法

发布于 2024-12-10 07:48:39 字数 586 浏览 5 评论 0原文

我的问题是我需要比较 URL 路径并推断它们是否相似。下面我提供了要处理的示例数据:

# GROUP 1
/robots.txt

# GROUP 2
/bot.html

# GROUP 3
/phpMyAdmin-2.5.6-rc1/scripts/setup.php
/phpMyAdmin-2.5.6-rc2/scripts/setup.php
/phpMyAdmin-2.5.6/scripts/setup.php
/phpMyAdmin-2.5.7-pl1/scripts/setup.php
/phpMyAdmin-2.5.7/scripts/setup.php
/phpMyAdmin-2.6.0-alpha/scripts/setup.php
/phpMyAdmin-2.6.0-alpha2/scripts/setup.php

# GROUP 4
//phpMyAdmin/

我尝试了 Levenshtein 距离进行比较,但对我来说不够准确。我不需要100%准确的算法,但我认为90%及以上是必须的。

我认为我需要某种分类器,但问题是新数据的每个部分都可以包含应分类到新的未知类的路径。

你能引导我正确的想法吗?

谢谢

My problem is that I need to compare URL paths and deduce if they are similar. Below I provide example data to process:

# GROUP 1
/robots.txt

# GROUP 2
/bot.html

# GROUP 3
/phpMyAdmin-2.5.6-rc1/scripts/setup.php
/phpMyAdmin-2.5.6-rc2/scripts/setup.php
/phpMyAdmin-2.5.6/scripts/setup.php
/phpMyAdmin-2.5.7-pl1/scripts/setup.php
/phpMyAdmin-2.5.7/scripts/setup.php
/phpMyAdmin-2.6.0-alpha/scripts/setup.php
/phpMyAdmin-2.6.0-alpha2/scripts/setup.php

# GROUP 4
//phpMyAdmin/

I tried Levenshtein distance to compare, but for me is not enough accurate. I do not need 100% accurate algorithm, but I think 90% and above is a must.

I think that I need some sort of classifier, but the problem is that each portion of new data can containt path that should be classified to the new unknown class.

Could you please direct me to the right thoutht?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

╄→承喏 2024-12-17 07:48:40

我知道这不是您问题的确切答案,但您熟悉 k-means算法?

我想即使是 Levenshtein 也可以在这里工作,但困难在于如何用这种方法计算质心。

也许您可以将输入集划分为不相交的子集,然后为每个子集中的每个 URL 计算到同一子集中的所有其他 URL 的距离,距离总和最小的 URL 应该是质心(当然,这取决于输入集有多大;对于巨大的集合,这样做可能不是一个好主意)。

k 均值的好处是您可以从绝对随机划分开始,然后迭代地使其变得更好。

k 均值的缺点是您必须在开始之前精确 k。然而,在运行过程中(也许在前几次迭代后情况稳定下来),您可以测量每个集合的内部相似性,如果它很低,您可以将集合划分为两个子集并继续使用相同的算法。

I know it's not the exact answer to your question, but are you familiar with k-means algorithm?

I guess even the Levenshtein can work here, the difficulty however is how to compute centroids with that approach.

Perhaps you can divide input set into disjoint subsets, then for each URL in each subset compute the distance to all the other URLs in the same subset, and the URL that has lowest sum of distances, should be the centroid (of course, it depends on how big is the input set; for huge sets it might be not a good idea to do so).

The good thing about k-means is that you can start with absolutely random division, and then iteratively make it better.

The bad thing about k-means is that you have to precise k before start. However, during the run (perhaps where the situation stabilized after first couple of iterations), you can measure intra-similarity of each set, and if it is low, you can divide the set into two subsets and go on with the same algorithm.

羁〃客ぐ 2024-12-17 07:48:39

编辑距离是最佳选择,但需要调整距离。您必须在标记(单词和数字)上使用加权编辑距离和可能的分割路径。因此,例如像“2.5.6-rc2 和 2.5.6”这样的版本可以被视为 0 权重差异,但像 phpMyAdmin 和 javaMyAdmin 这样的名称令牌给出 1 权重差异。

Levenshtein distance is best option, but tuned distance. You have to use weighted Edit distance and possibly split path on tokens - words and numbers. So for example version like "2.5.6-rc2 and 2.5.6" can be treated as 0 weight difference, but name token like phpMyAdmin and javaMyAdmin give 1 weight difference.

素手挽清风 2024-12-17 07:48:39

在检查@jakub.gieryluk的建议时,我意外地找到了令我满意的解决方案——“Hobohm聚类算法,最初是为了减少生物序列数据集的冗余而设计的。”

Bruno Vecchi 实现的 PERL 库测试 给了我非常好的结果。唯一的问题是我需要Python实现,但我相信我可以在互联网上找到一个或者自己重新实现代码。

接下来的事情是,我还没有检查该算法的主动学习能力;)

When checking @jakub.gieryluk suggestion I accidentally have found solution that satisfy me - "Hobohm clustering algorithm, originally devised to reduce redundancy of biological sequence data sets."

Tests of PERL library implemented by Bruno Vecchi gave me really good results. The only problem is that I need Python implementation, but I belive that I can either find one on the Internet or reimplement code by myself.

Next thing is that I have not checked active learning ability of this algorithm yet ;)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文