从没有空格/组合词的文本中检测最有可能的单词

发布于 2024-08-19 23:41:49 字数 136 浏览 7 评论 0原文

如何从组合字符串中检测和拆分单词?

例子:

"cdimage" -> ["cd", "image"]
"filesaveas" -> ["file", "save", "as"]

How could I detect and split words from a combined string?

Example:

"cdimage" -> ["cd", "image"]
"filesaveas" -> ["file", "save", "as"]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

背叛残局 2024-08-26 23:41:49

这是一个动态编程解决方案(作为记忆函数实现)。给定一个单词字典及其出现频率,它会将输入文本分割到给出总体最有可能的短语的位置。您必须找到一个真正的单词列表,但我包含了一些虚构的频率以进行简单的测试。

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

Here's a dynamic programming solution (implemented as a memoized function). Given a dictionary of words with their frequencies, it splits the input text at the positions that give the overall most likely phrase. You'll have to find a real wordlist, but I included some made-up frequencies for a simple test.

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])
懷念過去 2024-08-26 23:41:49

我不知道有什么库,但实现基本功能应该不难。

  1. 获取单词列表,如 UNIX 的 words
  2. 将单词列表的内容填充到字典树中。
  3. 取出要拆分的字符串并遵循其在 trie 中的路径。每次到达有效单词时,创建一个新分支,从您到达的字符串点开始搜索单词。完成当前分支后,回溯到您创建的分支,就像深度优先搜索一样。
  4. 使用启发式方法或通过自然语言解析器手动消除结果列表的歧义。

示例:

  1. 单词:“filesaveasstring”
  2. 第一个有效单词是“file”。尝试匹配“saveas”。第一个有效词是“保存”。尝试匹配“asstring”。第一个有效单词是“as”。尝试匹配“字符串”。第一个有效单词是“string”。匹配直至结束;将 [文件另存为字符串] 放入结果列表中。
  3. 回溯到匹配的“字符串” - 没有其他可能性。回溯到“asstring”。第一个未访问的有效单词是“ass”。尝试匹配“tring”。没有可能的匹配。回溯到“asstring”。没有可能的匹配。返回到“filesaveasstring”。
  4. 第一个未访问的匹配项是“files”。尝试匹配“aveasstring”。第一个匹配是“ave”。尝试匹配“asstring”(与步骤 2/3 的结果相同),将 [files ave as string] 添加到结果列表中并回溯到开头。
  5. 尝试匹配“filesaveasstring”。没有未观看的比赛。完毕。
  6. 使用启发式或自然语言解析器从 [[文件另存为字符串] [文件另存为字符串]] 中选择最有可能的。

I don't know of any library for it, but it shouldn't be hard to implement basic functionality.

  1. Get a words list, like UNIX's words.
  2. Stuff the contents of your word list into a trie.
  3. Take the string you want to split and follow its path in the trie. Each time you reach a valid word, create a new branch that searches for a word starting from the point of the string that you got to. Once you finish your current branch, backtrack to the one you created, like in a depth first search.
  4. Disambiguate the resulting lists manually, using heuristics or through a natural language parser.

Example:

  1. Word: "filesaveasstring"
  2. First valid word is "file". Try matching "saveas". First valid word is "save". Try matching "asstring". First valid word is "as". Try matching "string". First valid word is "string". Matched until end; put the [file save as string] into your results list.
  3. Backtrack to matching "string" - no other possibilities. Backtrack to "asstring". First unvisited valid word is "ass". Try matching "tring". No possible matches. Backtrack to "asstring". No possible matches. Backtrack to "filesaveasstring".
  4. First unvisited match is "files". Try to match "aveasstring". First match is "ave". Try matching "asstring" (same results as steps 2/3), adding [files ave as string] to your results list and backtrack to the start.
  5. Try matching "filesaveasstring". No unvisited matches. Done.
  6. Select the most likely from [[file save as string] [files ave as string]] using a heuristic or a natural language parser.
梦中的蝴蝶 2024-08-26 23:41:49

我不知道有哪个库可以执行此操作,但是如果您有一个单词列表,那么编写起来并不太难:

wordList = file('words.txt','r').read().split()
words = set( s.lower() for s in wordList )

def splitString(s):
    found = []

    def rec(stringLeft, wordsSoFar):
        if not stringLeft:
            found.append(wordsSoFar)
        for pos in xrange(1, len(stringLeft)+1):
            if stringLeft[:pos] in words:
                rec(stringLeft[pos:], wordsSoFar + [stringLeft[:pos]])

    rec(s.lower(), [])
    return found

这将返回将字符串拆分为给定单词的所有可能方法。

例子:

>>> splitString('filesaveas')
[['file', 'save', 'as'], ['files', 'ave', 'as']]

I don't know a library that does this, but it's not too hard to write if you have a list of words:

wordList = file('words.txt','r').read().split()
words = set( s.lower() for s in wordList )

def splitString(s):
    found = []

    def rec(stringLeft, wordsSoFar):
        if not stringLeft:
            found.append(wordsSoFar)
        for pos in xrange(1, len(stringLeft)+1):
            if stringLeft[:pos] in words:
                rec(stringLeft[pos:], wordsSoFar + [stringLeft[:pos]])

    rec(s.lower(), [])
    return found

This will return all possible ways to split the string into the given words.

Example:

>>> splitString('filesaveas')
[['file', 'save', 'as'], ['files', 'ave', 'as']]
过度放纵 2024-08-26 23:41:49

可以看这个例子:但是它是用 scala 编写的。
当句子之间不包含空格时,这可以分割您想要的任何内容。

Nonspaced-Sentence-Tokenizer

Can see this example : But its written in scala.
This can split anything you want when the sentence contains no space in between.

Nonspaced-Sentence-Tokenizer

纸短情长 2024-08-26 23:41:49

我知道这个问题是针对 Python 的,但我需要一个 JavaScript 实现。离开之前的答案,我想我应该分享我的代码。看起来工作还不错。

function findWords(input){
    input = input.toLowerCase().replace(/\s/g, ""); //Strip whitespace

    var index = 0;
    var validWords = [];
    for (var len = input.length; len > 0; len--){ //Go backwards as to favor longer words
        var testWord = input.substr(index, len);
        var dictIndex = _dictionary.indexOf(testWord.replace(/[^a-z\']/g, "")); //Remove non-letters
        if (dictIndex != -1){
            validWords.push(testWord);
            if (len == input.length){
                break; //We are complete
            }
            var nextWords = findWords(input.substr(len, input.length - len)); //Recurse
            if (!nextWords.words.length){ //No further valid words
                validWords.pop();
            }
            validWords = validWords.concat(nextWords.words);
            if (nextWords.complete === true){
                break; //Cascade complete
            }
        }
    }
    return {
        complete:len > 0, //We broke which indicates completion
        words:validWords
    };
}

注意:“_dictionary”应该是按频率排序的单词数组。我正在使用古腾堡计划中的词汇表。

I know this question is marked for Python but I needed a JavaScript implementation. Going off of the previous answers I figured I'd share my code. Seems to work decently.

function findWords(input){
    input = input.toLowerCase().replace(/\s/g, ""); //Strip whitespace

    var index = 0;
    var validWords = [];
    for (var len = input.length; len > 0; len--){ //Go backwards as to favor longer words
        var testWord = input.substr(index, len);
        var dictIndex = _dictionary.indexOf(testWord.replace(/[^a-z\']/g, "")); //Remove non-letters
        if (dictIndex != -1){
            validWords.push(testWord);
            if (len == input.length){
                break; //We are complete
            }
            var nextWords = findWords(input.substr(len, input.length - len)); //Recurse
            if (!nextWords.words.length){ //No further valid words
                validWords.pop();
            }
            validWords = validWords.concat(nextWords.words);
            if (nextWords.complete === true){
                break; //Cascade complete
            }
        }
    }
    return {
        complete:len > 0, //We broke which indicates completion
        words:validWords
    };
}

Note: "_dictionary" is expected to be an array of words sorted by frequency. I am using a wordlist from Project Gutenberg.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文