从域中提取单词

发布于 2024-12-06 11:32:22 字数 711 浏览 2 评论 0原文

我有一堆域想要分解成单词。我从 wordlist.sourceforge.net 下载了 wordlist，并开始编写暴力类型的脚本来通过字典列表运行每个域。

问题是我无法让它产生足够好的结果。我所做的简单脚本如下所示：

foreach($domains as $dom) {
    $orig_dom = $dom;
    foreach($words as $w) {
        $pos = stristr($dom,$w);
        if($pos) {
            $wd[$orig_dom][] = $w;
        }
    }
}

$words 是字典数组，domains 只是域名数组。

结果看起来像这样：

[aheadsoftware] => Array
    (
        [0] => ahead
        [1] => head
        [2] => heads
        [3] => soft
        [4] => software
        [5] => ware

从技术上讲它是有效的，但我不知道如何编码是让脚本理解如果你匹配“ahead”，你就不再有“head”或“heads”的技巧。它还应该理解选择“软件”而不是“软件”和“软件”。是的，我知道，语言计算的世界纯粹是痛苦；）

原文

I have a bunch of domains I would like to explode into words. I downloaded wordlist from wordlist.sourceforge.net and started writing brute-force type of script to run each domain through dictionary list.

The problem is that I can't get it to produce good enough results. The simple script I did looks like this:

foreach($domains as $dom) {
    $orig_dom = $dom;
    foreach($words as $w) {
        $pos = stristr($dom,$w);
        if($pos) {
            $wd[$orig_dom][] = $w;
        }
    }
}

$words is dictionary array and domains is just an array of domain names.

Results looks like this:

[aheadsoftware] => Array
    (
        [0] => ahead
        [1] => head
        [2] => heads
        [3] => soft
        [4] => software
        [5] => ware

Technically it works but the thing I don't know how to code is the trick to get the script to understand that if you match 'ahead', you don't have 'head' or 'heads' anymore. It should also understand to pick 'software' instead of 'soft' and 'ware'. Yes I know, world of linguistic computing is pure pain ;)

分享到QQ

分享到微博