给定一个大的 URL 列表,将 URL 分组为模式或正则表达式的最佳数据挖掘方法是什么?
我有一个包含 100 万个 URL 的列表,我想将相似的 URL 聚集在一起。该过程的输出将是正则表达式或模式的列表。理想情况下,我想使用 Ruby 来导出数据。我最初的想法是使用机器学习分类器,但我不确定从哪里开始或使用哪种数据挖掘技术。
可能的示例:
输入:
http://www.example.com/folder-A/file.html
http://www.example.com/folder-A/dude.html
http://www.example.com/folder-B/huh.html
http://www.example.com/folder-C/what-ever.html
输出:
http://www\.example\.com/folder-A/[a-z]\.html
http://www\.example\.com/folder-[A-C]/[-a-z]\.html
I've got a list of 1 million URLs and I'd like to cluster similar URLs together. The output of the process would be a list of regular expressions or patterns. Ideally I'd like to use Ruby to derive the data. My initial thoughts flow toward using a Machine Learning classifier, but I'm not sure where to start or what data mining technique to use.
Possible example:
Input:
http://www.example.com/folder-A/file.html
http://www.example.com/folder-A/dude.html
http://www.example.com/folder-B/huh.html
http://www.example.com/folder-C/what-ever.html
Output:
http://www\.example\.com/folder-A/[a-z]\.html
http://www\.example\.com/folder-[A-C]/[-a-z]\.html
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
该程序:
运行时,适时打印出:
这就是您要找的吗?
This program:
when run, duly prints out:
Is that what you were looking for?
您好,您可以使用这个(http://www.brics.dk/automaton/)自动机库来创建或操作多个字符串,然后优化自动机,在这种情况下您将只获得一个通用的正则表达式。
更简单的解决方案是使用前缀优化来提取类似的第一部分,请查看此示例 http://code.google.com/p/graph-expression/wiki/RegexpOptimization。
不幸的是,所有这些东西都是为 java 完成的,但当然生成的正则表达式可以在任何正则表达式引擎中使用。
Hi you can use this(http://www.brics.dk/automaton/) automaton library to create or operation of several string and then optimize automaton in this case you will just get generilized one Regular expression.
More simple solution is to use prefix optimization to extract similar first part, for this look at this example http://code.google.com/p/graph-expression/wiki/RegexpOptimization.
Unfortunately all this stuff is done for java, but of course generated regexp can be used in any regular expression engine.
如果您询问如何使用正则表达式解析 URL,请查看 IETF 的 RFC 2396 。
从那里您应该能够比较 URL 的片段并识别模式。
If you are asking how you should parse a URL with a regular expression then take a look at the IETF's RFC 2396.
From there you should be able to compare the fragments of the URL and identify patterns.
您的问题有点模糊,但这听起来像是您可以通过映射/归约类型设置来完成的操作。将数据划分为更小的块,按“根”对每个块进行分组(无论您的意思是什么,我假设“权限”或者可能是“方案”+“权限”),然后在减少阶段合并这些组。
Your question is a bit vague, but it sounds like something you could do with a map/reduce type setup. Partition your data in smaller chunks, group each chunk by "root" (whatever you mean by that, I assume "authority" or maybe "scheme" + "authority") and then merge the groups in the reduce stage.