从域中提取单词
我有一堆域想要分解成单词。我从 wordlist.sourceforge.net 下载了 wordlist,并开始编写暴力类型的脚本来通过字典列表运行每个域。
问题是我无法让它产生足够好的结果。我所做的简单脚本如下所示:
foreach($domains as $dom) {
$orig_dom = $dom;
foreach($words as $w) {
$pos = stristr($dom,$w);
if($pos) {
$wd[$orig_dom][] = $w;
}
}
}
$words 是字典数组,domains 只是域名数组。
结果看起来像这样:
[aheadsoftware] => Array
(
[0] => ahead
[1] => head
[2] => heads
[3] => soft
[4] => software
[5] => ware
从技术上讲它是有效的,但我不知道如何编码是让脚本理解如果你匹配“ahead”,你就不再有“head”或“heads”的技巧。它还应该理解选择“软件”而不是“软件”和“软件”。是的,我知道,语言计算的世界纯粹是痛苦;)
I have a bunch of domains I would like to explode into words. I downloaded wordlist from wordlist.sourceforge.net and started writing brute-force type of script to run each domain through dictionary list.
The problem is that I can't get it to produce good enough results. The simple script I did looks like this:
foreach($domains as $dom) {
$orig_dom = $dom;
foreach($words as $w) {
$pos = stristr($dom,$w);
if($pos) {
$wd[$orig_dom][] = $w;
}
}
}
$words is dictionary array and domains is just an array of domain names.
Results looks like this:
[aheadsoftware] => Array
(
[0] => ahead
[1] => head
[2] => heads
[3] => soft
[4] => software
[5] => ware
Technically it works but the thing I don't know how to code is the trick to get the script to understand that if you match 'ahead', you don't have 'head' or 'heads' anymore. It should also understand to pick 'software' instead of 'soft' and 'ware'. Yes I know, world of linguistic computing is pure pain ;)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一个天真的解决方案可能是每次您有匹配项时,在将单词添加到结果中之前,再进行一次
stristr
查找,看看您尝试放入结果中的单词是否包含在任何那里已经有的话了。如果是,请不要添加它。例如,如果域包含“heads”并且您的字典首先列出“head”,则这将不起作用。您可能宁愿将“正面”添加到结果中,而不是“正面”。
您可以通过检查哪一个更长来绕过这一限制。如果结果中包含的单词较长,请不要添加新单词。如果新单词较长,请删除结果中已有的单词并添加新单词。
A naive solution could be every time you have a match and before you add the word in to the results do another
stristr
lookup and see if the word you are trying to put in to the results is contained in any of the words already in there. If it is, don't add it in.This would not work for example if the domain contains 'heads' and your dictionary lists 'head' first. You may rather have 'heads' added in to the results instead of 'head'.
You can get around that limitation by checking to see which one is longer. If the word contained in your results is longer, do not add the new word in. If the new word is longer, remove the one already in the results and add the new one in.