分词算法

发布于 2024-07-30 07:57:56 字数 105 浏览 5 评论 0原文

似乎在域名停放页面上使用的算法是什么?它采用一堆无空格的单词(例如“thecarrotofcuriosity”)并或多或少正确地将其分解为组成词(例如“the carrot ofurious”)?

What is the algorithm - seemingly in use on domain parking pages - that takes a spaceless bunch of words (eg "thecarrotofcuriosity") and more-or-less correctly breaks it down into the constituent words (eg "the carrot of curiosity") ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

甜味拾荒者 2024-08-06 07:57:56

从表示字典的基本 Trie 数据结构开始。 当您迭代字符串的字符时,请使用一组指针而不是单个指针在 trie 中搜索 - 该集合以 trie 的根为种子。 对于每个字母,整个集合通过该字母指示的指针立即前进,如果集合元素不能通过该字母前进,则将其从集合中删除。 每当到达可能的词尾时,就向集合中添加一个新的 trie 根(跟踪与该集合元素关联的单词列表)。 最后,处理完所有字符后,返回位于 trie 根的任意单词列表。 如果有多个,则意味着该字符串可以以多种方式分解(例如“therapyforum”可以解析为 [“therapy”, “forum”] 或 [“the”, “rapist”, “forum” ])并且我们将返回哪个是未定义的。

或者,在一个奇怪的伪代码中(Java foreach,用括号表示的元组,用大括号表示的集合,cons 使用 head :: tail,[] 是空列表):

List<String> breakUp(String str, Trie root) {
    Set<(List<String>, Trie)> set = {([], root)};
    for (char c : str) {
        Set<(List<String>, Trie)> newSet = {};
        for (List<String> ls, Trie t : set) {
            Trie tNext = t.follow(c);
            if (tNext != null) {
                newSet.add((ls, tNext));
                if (tNext.isWord()) {
                    newSet.add((t.follow(c).getWord() :: ls, root));
                }
            }
        }
        set = newSet;
     }
     for (List<String> ls, Trie t : set) {
        if (t == root) return ls;
     }
     return null;
 }

如果我需要澄清或者我错过了什么,请告诉我......

Start with a basic Trie data structure representing your dictionary. As you iterate through the characters of the the string, search your way through the trie with a set of pointers rather than a single pointer - the set is seeded with the root of the trie. For each letter, the whole set is advanced at once via the pointer indicated by the letter, and if a set element cannot be advanced by the letter, it is removed from the set. Whenever you reach a possible end-of-word, add a new root-of-trie to the set (keeping track of the list of words seen associated with that set element). Finally, once all characters have been processed, return an arbitrary list of words which is at the root-of-trie. If there's more than one, that means the string could be broken up in multiple ways (such as "therapistforum" which can be parsed as ["therapist", "forum"] or ["the", "rapist", "forum"]) and it's undefined which we'll return.

Or, in a wacked up pseudocode (Java foreach, tuple indicated with parens, set indicated with braces, cons using head :: tail, [] is the empty list):

List<String> breakUp(String str, Trie root) {
    Set<(List<String>, Trie)> set = {([], root)};
    for (char c : str) {
        Set<(List<String>, Trie)> newSet = {};
        for (List<String> ls, Trie t : set) {
            Trie tNext = t.follow(c);
            if (tNext != null) {
                newSet.add((ls, tNext));
                if (tNext.isWord()) {
                    newSet.add((t.follow(c).getWord() :: ls, root));
                }
            }
        }
        set = newSet;
     }
     for (List<String> ls, Trie t : set) {
        if (t == root) return ls;
     }
     return null;
 }

Let me know if I need to clarify or I missed something...

薯片软お妹 2024-08-06 07:57:56

我想他们会在你的普通或普通的 Unix 系统上获取像 /usr/share/dict/words 这样的字典单词列表,并尝试找到结果的单词匹配集(从左边开始?)匹配所覆盖的最大量的原始文本。 一个简单的广度优先搜索实现可能会很好地工作,因为它显然不需要运行得很快。

I would imagine they take a dictionary word list like /usr/share/dict/words on your common or garden variety Unix system and try to find sets of word matches (starting from the left?) that result in the largest amount of original text being covered by a match. A simple breadth-first-search implementation would probably work fine, since it obviously doesn't have to run fast.

时常饿 2024-08-06 07:57:56

我想象这些网站会这样做:

  1. 获取目标语言的单词列表
  2. 删除“无用”单词,例如“a”,“the”,...
  3. 运行列表并检查哪些单词是子字符串域名的
  4. 取剩余列表中最常见的单词(或者具有最高 Adsense 评级的单词,...)

当然,这对 ExpertExchange 来说是无意义的,但是您还能期待什么...

I'd imaging these sites do it similar to this:

  1. Get a list of word for your target language
  2. Remove "useless" words like "a", "the", ...
  3. Run through the list and check which of the words are substrings of the domain name
  4. Take the most common words of the remaining list (Or the ones with the highest adsense rating,...)

Of course that leads to nonsense for expertsexchange, but what else would you expect there...

夜无邪 2024-08-06 07:57:56

(免责声明:我自己没有尝试过,所以仅将其作为实验用的食物。4克大多是从蓝天中取出的,只是根据我的经验,3克效果不太好;5-克或更多可能效果更好,即使您必须处理相当大的桌子)。 从某种意义上说,它也很简单,因为它没有考虑字符串的结尾 - 如果它对你有用,否则你可能需要考虑修复结尾。

该算法将在可预测的时间内运行,该时间与您尝试拆分的字符串的长度成正比。

所以,首先:获取大量人类可读的文本。 对于每个文本,假设它在单个字符串 str 中,运行以下算法(伪代码式表示法,假设 [] 是类似哈希表的索引,并且不存在的索引返回 '0 '):

for(i=0;i<length(s)-5;i++) {
  // take 4-character substring starting at position i
  subs2 = substring(str, i, 4); 
  if(has_space(subs2)) {
    subs = substring(str, i, 5);
    delete_space(subs);
    yes_space[subs][position(space, subs2)]++;
  } else {
    subs = subs2;
    no_space[subs]++;
  }
}

这将为您构建表格,帮助您决定给定的 4-gram 是否需要在其中插入空格。

然后,将字符串拆分,我将其表示为 xstr,然后执行以下操作:

for(i=0;i<length(xstr)-5;i++) {
  subs = substring(xstr, i, 4);
  for(j=0;j<4;j++) {
    do_insert_space_here[i+j] -= no_space[subs];
  }
  for(j=0;j<4;j++) {
    do_insert_space_here[i+j] += yes_space[subs][j];
  }
}

然后,您可以遍历“do_insert_space_here[]”数组 - 如果元素位于给定位置大于0,那么你应该在原始字符串的该位置插入一个空格。 如果它小于零,那么你不应该这样做。

如果您尝试它(或类似的东西)并且它对您有用(或不起作用),请在此处留言:-)

(disclaimer: I did not try it myself, so take it merely as a food for experimentation. 4-grams are taken mostly out of the blue sky, just from my experience that 3-grams won't work all too well; 5-grams and more might work better, even though you will have to deal with a pretty large table). It's also simplistic in a sense that it does not take into the account the ending of the string - if it works for you otherwise, you'd probably need to think about fixing the endings.

This algorithm would run in a predictable time proportional to the length of the string that you are trying to split.

So, first: Take a lot of human-readable texts. for each of the text, supposing it is in a single string str, run the following algorithm (pseudocode-ish notation, assumes the [] is a hashtable-like indexing, and that nonexistent indexes return '0'):

for(i=0;i<length(s)-5;i++) {
  // take 4-character substring starting at position i
  subs2 = substring(str, i, 4); 
  if(has_space(subs2)) {
    subs = substring(str, i, 5);
    delete_space(subs);
    yes_space[subs][position(space, subs2)]++;
  } else {
    subs = subs2;
    no_space[subs]++;
  }
}

This will build you the tables which will help to decide whether a given 4-gram would need to have a space in it inserted or not.

Then, take your string to split, I denote it as xstr, and do:

for(i=0;i<length(xstr)-5;i++) {
  subs = substring(xstr, i, 4);
  for(j=0;j<4;j++) {
    do_insert_space_here[i+j] -= no_space[subs];
  }
  for(j=0;j<4;j++) {
    do_insert_space_here[i+j] += yes_space[subs][j];
  }
}

Then you can walk the "do_insert_space_here[]" array - if an element at a given position is bigger than 0, then you should insert a space in that position in the original string. If it's less than zero, then you shouldn't.

Please drop a note here if you try it (or something of this sort) and it works (or does not work) for you :-)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文