给定一个大的 URL 列表,将 URL 分组为模式或正则表达式的最佳数据挖掘方法是什么?

发布于 2024-12-01 09:43:49 字数 491 浏览 1 评论 0原文

我有一个包含 100 万个 URL 的列表,我想将相似的 URL 聚集在一起。该过程的输出将是正则表达式或模式的列表。理想情况下,我想使用 Ruby 来导出数据。我最初的想法是使用机器学习分类器,但我不确定从哪里开始或使用哪种数据挖掘技术。

可能的示例:

输入:

http://www.example.com/folder-A/file.html
http://www.example.com/folder-A/dude.html
http://www.example.com/folder-B/huh.html
http://www.example.com/folder-C/what-ever.html

输出:

http://www\.example\.com/folder-A/[a-z]\.html
http://www\.example\.com/folder-[A-C]/[-a-z]\.html

I've got a list of 1 million URLs and I'd like to cluster similar URLs together. The output of the process would be a list of regular expressions or patterns. Ideally I'd like to use Ruby to derive the data. My initial thoughts flow toward using a Machine Learning classifier, but I'm not sure where to start or what data mining technique to use.

Possible example:

Input:

http://www.example.com/folder-A/file.html
http://www.example.com/folder-A/dude.html
http://www.example.com/folder-B/huh.html
http://www.example.com/folder-C/what-ever.html

Output:

http://www\.example\.com/folder-A/[a-z]\.html
http://www\.example\.com/folder-[A-C]/[-a-z]\.html

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

尤怨 2024-12-08 09:43:49

该程序:

#!/usr/bin/env perl

use strict;
use warnings;

# the following is a CPAN module requiring independent installation:
use Regexp::Assemble;

my @url_list = qw(
    http://www.example.com/folder-A/file.html
    http://www.example.com/folder-A/dude.html
    http://www.example.com/folder-B/huh.html
    http://www.example.com/folder-C/what-ever.html
);

my $asm = Regexp::Assemble->new;
for my $url (@url_list) {
    $asm->add($url);
}

my $pat = $asm->re;
for ($pat) {
    s/^.*?://;
    s/\)$//;
}

print "$pat\n";

运行时,适时打印出:

http://www.example.com/folder-(?:A/(?:dud|fil)e|C/what-ever|B/huh).html

这就是您要找的吗?

This program:

#!/usr/bin/env perl

use strict;
use warnings;

# the following is a CPAN module requiring independent installation:
use Regexp::Assemble;

my @url_list = qw(
    http://www.example.com/folder-A/file.html
    http://www.example.com/folder-A/dude.html
    http://www.example.com/folder-B/huh.html
    http://www.example.com/folder-C/what-ever.html
);

my $asm = Regexp::Assemble->new;
for my $url (@url_list) {
    $asm->add($url);
}

my $pat = $asm->re;
for ($pat) {
    s/^.*?://;
    s/\)$//;
}

print "$pat\n";

when run, duly prints out:

http://www.example.com/folder-(?:A/(?:dud|fil)e|C/what-ever|B/huh).html

Is that what you were looking for?

不离久伴 2024-12-08 09:43:49

您好,您可以使用这个(http://www.brics.dk/automaton/)自动机库来创建或操作多个字符串,然后优化自动机,在这种情况下您将只获得一个通用的正则表达式。

更简单的解决方案是使用前缀优化来提取类似的第一部分,请查看此示例 http://code.google.com/p/graph-expression/wiki/RegexpOptimization

不幸的是,所有这些东西都是为 java 完成的,但当然生成的正则表达式可以在任何正则表达式引擎中使用。

Hi you can use this(http://www.brics.dk/automaton/) automaton library to create or operation of several string and then optimize automaton in this case you will just get generilized one Regular expression.

More simple solution is to use prefix optimization to extract similar first part, for this look at this example http://code.google.com/p/graph-expression/wiki/RegexpOptimization.

Unfortunately all this stuff is done for java, but of course generated regexp can be used in any regular expression engine.

层林尽染 2024-12-08 09:43:49

如果您询问如何使用正则表达式解析 URL,请查看 IETF 的 RFC 2396

RFC 2396 URI 通用语法 8 月
1998年

B.使用正则表达式解析 URI 引用

如第 4.3 节所述,通用 URI 语法不是
足以消除某些形式 URI 的组成部分的歧义。
由于该部分中描述的“贪婪算法”是相同的
对于 POSIX 正则表达式使用的消歧方法,它
使用正则表达式进行解析是自然且常见的
URI 的潜在四个组成部分和片段标识符
参考。

下面一行是分解a的正则表达式
对其组件的 URI 引用。

 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^# ]*))?(#(.*))?
   12 3 4 5 6 7 8 9

上面第二行中的数字仅供参考
可读性;它们指示每个的参考点
子表达式(即每个配对的括号)。我们参考的是
与 $ 子表达式匹配的值。例如,匹配
上面的表达式为

<前><代码> http://www.ics.uci.edu/pub/ietf/uri/#Related

产生以下子表达式匹配:

<前><代码> $1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <未定义>;
$7 = <未定义>;
$8 = #相关
$9 = 相关

其中表示该组件不存在,如
就是上面例子中的查询组件的情况。
因此,我们可以确定四个分量的值
片段为

<前><代码>方案 = $2
权限 = $4
路径 = 5 美元
查询 = $7
片段 = 9 美元

并且,朝相反的方向,我们可以重新创建一个 URI
使用步骤 7 中的算法从其组件中引用
第 5.2 节。

从那里您应该能够比较 URL 的片段并识别模式。

If you are asking how you should parse a URL with a regular expression then take a look at the IETF's RFC 2396.

RFC 2396 URI Generic Syntax August
1998

B. Parsing a URI Reference with a Regular Expression

As described in Section 4.3, the generic URI syntax is not
sufficient to disambiguate the components of some forms of URI.
Since the "greedy algorithm" described in that section is identical
to the disambiguation method used by POSIX regular expressions, it
is natural and commonplace to use a regular expression for parsing
the potential four components and fragment identifier of a URI
reference.

The following line is the regular expression for breaking-down a
URI reference into its components.

  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
   12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist
readability; they indicate the reference points for each
subexpression (i.e., each paired parenthesis). We refer to the
value matched for subexpression as $. For example, matching
the above expression to

  http://www.ics.uci.edu/pub/ietf/uri/#Related

results in the following subexpression matches:

  $1 = http:
  $2 = http
  $3 = //www.ics.uci.edu
  $4 = www.ics.uci.edu
  $5 = /pub/ietf/uri/
  $6 = <undefined>
  $7 = <undefined>
  $8 = #Related
  $9 = Related

where indicates that the component is not present, as
is the case for the query component in the above example.
Therefore, we can determine the value of the four components and
fragment as

  scheme    = $2
  authority = $4
  path      = $5
  query     = $7
  fragment  = $9

and, going in the opposite direction, we can recreate a URI
reference from its components using the algorithm in step 7 of
Section 5.2.

From there you should be able to compare the fragments of the URL and identify patterns.

童话 2024-12-08 09:43:49

您的问题有点模糊,但这听起来像是您可以通过映射/归约类型设置来完成的操作。将数据划分为更小的块,按“根”对每个块进行分组(无论您的意思是什么,我假设“权限”或者可能是“方案”+“权限”),然后在减少阶段合并这些组。

Your question is a bit vague, but it sounds like something you could do with a map/reduce type setup. Partition your data in smaller chunks, group each chunk by "root" (whatever you mean by that, I assume "authority" or maybe "scheme" + "authority") and then merge the groups in the reduce stage.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文