给定一个大的 URL 列表，将 URL 分组为模式或正则表达式的最佳数据挖掘方法是什么？

发布于 2024-12-01 09:43:49 字数 491 浏览 10 评论 0原文

我有一个包含 100 万个 URL 的列表，我想将相似的 URL 聚集在一起。该过程的输出将是正则表达式或模式的列表。理想情况下，我想使用 Ruby 来导出数据。我最初的想法是使用机器学习分类器，但我不确定从哪里开始或使用哪种数据挖掘技术。

可能的示例：

输入：

http://www.example.com/folder-A/file.html
http://www.example.com/folder-A/dude.html
http://www.example.com/folder-B/huh.html
http://www.example.com/folder-C/what-ever.html

输出：

http://www\.example\.com/folder-A/[a-z]\.html
http://www\.example\.com/folder-[A-C]/[-a-z]\.html

原文

I've got a list of 1 million URLs and I'd like to cluster similar URLs together. The output of the process would be a list of regular expressions or patterns. Ideally I'd like to use Ruby to derive the data. My initial thoughts flow toward using a Machine Learning classifier, but I'm not sure where to start or what data mining technique to use.

Possible example:

Input:

http://www.example.com/folder-A/file.html
http://www.example.com/folder-A/dude.html
http://www.example.com/folder-B/huh.html
http://www.example.com/folder-C/what-ever.html

Output:

http://www\.example\.com/folder-A/[a-z]\.html
http://www\.example\.com/folder-[A-C]/[-a-z]\.html

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尤怨 2024-12-08 09:43:49

该程序：

#!/usr/bin/env perl

use strict;
use warnings;

# the following is a CPAN module requiring independent installation:
use Regexp::Assemble;

my @url_list = qw(
    http://www.example.com/folder-A/file.html
    http://www.example.com/folder-A/dude.html
    http://www.example.com/folder-B/huh.html
    http://www.example.com/folder-C/what-ever.html
);

my $asm = Regexp::Assemble->new;
for my $url (@url_list) {
    $asm->add($url);
}

my $pat = $asm->re;
for ($pat) {
    s/^.*?://;
    s/\)$//;
}

print "$pat\n";

运行时，适时打印出：

http://www.example.com/folder-(?:A/(?:dud|fil)e|C/what-ever|B/huh).html

这就是您要找的吗？

This program:

#!/usr/bin/env perl

use strict;
use warnings;

# the following is a CPAN module requiring independent installation:
use Regexp::Assemble;

my @url_list = qw(
    http://www.example.com/folder-A/file.html
    http://www.example.com/folder-A/dude.html
    http://www.example.com/folder-B/huh.html
    http://www.example.com/folder-C/what-ever.html
);

my $asm = Regexp::Assemble->new;
for my $url (@url_list) {
    $asm->add($url);
}

my $pat = $asm->re;
for ($pat) {
    s/^.*?://;
    s/\)$//;
}

print "$pat\n";

when run, duly prints out:

http://www.example.com/folder-(?:A/(?:dud|fil)e|C/what-ever|B/huh).html

Is that what you were looking for?

回复收藏 0 原文

不离久伴 2024-12-08 09:43:49

您好，您可以使用这个（http://www.brics.dk/automaton/）自动机库来创建或操作多个字符串，然后优化自动机，在这种情况下您将只获得一个通用的正则表达式。

更简单的解决方案是使用前缀优化来提取类似的第一部分，请查看此示例 http://code.google.com/p/graph-expression/wiki/RegexpOptimization。

不幸的是，所有这些东西都是为 java 完成的，但当然生成的正则表达式可以在任何正则表达式引擎中使用。

回复收藏 0 原文

层林尽染 2024-12-08 09:43:49

如果您询问如何使用正则表达式解析 URL，请查看 IETF 的 RFC 2396 。

RFC 2396 URI 通用语法 8 月
1998年
B.使用正则表达式解析 URI 引用
如第 4.3 节所述，通用 URI 语法不是
足以消除某些形式 URI 的组成部分的歧义。
由于该部分中描述的“贪婪算法”是相同的
对于 POSIX 正则表达式使用的消歧方法，它
使用正则表达式进行解析是自然且常见的
URI 的潜在四个组成部分和片段标识符
参考。
下面一行是分解a的正则表达式
对其组件的 URI 引用。
 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^# ]*))?(#(.*))？
   12 3 4 5 6 7 8 9
上面第二行中的数字仅供参考
可读性;它们指示每个的参考点
子表达式（即每个配对的括号）。我们参考的是
与 $ 子表达式匹配的值。例如，匹配
上面的表达式为
<前><代码> http://www.ics.uci.edu/pub/ietf/uri/#Related
产生以下子表达式匹配：
<前><代码> $1 = http：
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <未定义>;
$7 = <未定义>;
$8 = #相关
$9 = 相关
其中表示该组件不存在，如
就是上面例子中的查询组件的情况。
因此，我们可以确定四个分量的值
片段为
<前><代码>方案 = $2
权限 = $4
路径 = 5 美元
查询 = $7
片段 = 9 美元
并且，朝相反的方向，我们可以重新创建一个 URI
使用步骤 7 中的算法从其组件中引用
第 5.2 节。

从那里您应该能够比较 URL 的片段并识别模式。

If you are asking how you should parse a URL with a regular expression then take a look at the IETF's RFC 2396.

RFC 2396 URI Generic Syntax August
1998
B. Parsing a URI Reference with a Regular Expression
As described in Section 4.3, the generic URI syntax is not
sufficient to disambiguate the components of some forms of URI.
Since the "greedy algorithm" described in that section is identical
to the disambiguation method used by POSIX regular expressions, it
is natural and commonplace to use a regular expression for parsing
the potential four components and fragment identifier of a URI
reference.
The following line is the regular expression for breaking-down a
URI reference into its components.
  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
   12            3  4          5       6  7        8 9
The numbers in the second line above are only to assist
readability; they indicate the reference points for each
subexpression (i.e., each paired parenthesis). We refer to the
value matched for subexpression as $. For example, matching
the above expression to
  http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
  $1 = http:
  $2 = http
  $3 = //www.ics.uci.edu
  $4 = www.ics.uci.edu
  $5 = /pub/ietf/uri/
  $6 = <undefined>
  $7 = <undefined>
  $8 = #Related
  $9 = Related
where indicates that the component is not present, as
is the case for the query component in the above example.
Therefore, we can determine the value of the four components and
fragment as
  scheme    = $2
  authority = $4
  path      = $5
  query     = $7
  fragment  = $9
and, going in the opposite direction, we can recreate a URI
reference from its components using the algorithm in step 7 of
Section 5.2.