从没有空格/组合词的文本中检测最有可能的单词

发布于 2024-08-19 23:41:49 字数 136 浏览 8 评论 0原文

如何从组合字符串中检测和拆分单词？

例子：

"cdimage" -> ["cd", "image"]
"filesaveas" -> ["file", "save", "as"]

原文

How could I detect and split words from a combined string?

Example:

"cdimage" -> ["cd", "image"]
"filesaveas" -> ["file", "save", "as"]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

背叛残局 2024-08-26 23:41:49

这是一个动态编程解决方案（作为记忆函数实现）。给定一个单词字典及其出现频率，它会将输入文本分割到给出总体最有可能的短语的位置。您必须找到一个真正的单词列表，但我包含了一些虚构的频率以进行简单的测试。

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

Here's a dynamic programming solution (implemented as a memoized function). Given a dictionary of words with their frequencies, it splits the input text at the positions that give the overall most likely phrase. You'll have to find a real wordlist, but I included some made-up frequencies for a simple test.

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

回复收藏 0 原文

懷念過去 2024-08-26 23:41:49

我不知道有什么库，但实现基本功能应该不难。

获取单词列表，如 UNIX 的 words。
将单词列表的内容填充到字典树中。
取出要拆分的字符串并遵循其在 trie 中的路径。每次到达有效单词时，创建一个新分支，从您到达的字符串点开始搜索单词。完成当前分支后，回溯到您创建的分支，就像深度优先搜索一样。
使用启发式方法或通过自然语言解析器手动消除结果列表的歧义。

示例：

单词：“filesaveasstring”
第一个有效单词是“file”。尝试匹配“saveas”。第一个有效词是“保存”。尝试匹配“asstring”。第一个有效单词是“as”。尝试匹配“字符串”。第一个有效单词是“string”。匹配直至结束；将 [文件另存为字符串] 放入结果列表中。
回溯到匹配的“字符串” - 没有其他可能性。回溯到“asstring”。第一个未访问的有效单词是“ass”。尝试匹配“tring”。没有可能的匹配。回溯到“asstring”。没有可能的匹配。返回到“filesaveasstring”。
第一个未访问的匹配项是“files”。尝试匹配“aveasstring”。第一个匹配是“ave”。尝试匹配“asstring”（与步骤 2/3 的结果相同），将 [files ave as string] 添加到结果列表中并回溯到开头。
尝试匹配“filesaveasstring”。没有未观看的比赛。完毕。
使用启发式或自然语言解析器从 [[文件另存为字符串] [文件另存为字符串]] 中选择最有可能的。

回复收藏 0 原文

梦中的蝴蝶 2024-08-26 23:41:49

我不知道有哪个库可以执行此操作，但是如果您有一个单词列表，那么编写起来并不太难：

wordList = file('words.txt','r').read().split()
words = set( s.lower() for s in wordList )

def splitString(s):
    found = []

    def rec(stringLeft, wordsSoFar):
        if not stringLeft:
            found.append(wordsSoFar)
        for pos in xrange(1, len(stringLeft)+1):
            if stringLeft[:pos] in words:
                rec(stringLeft[pos:], wordsSoFar + [stringLeft[:pos]])

    rec(s.lower(), [])
    return found

这将返回将字符串拆分为给定单词的所有可能方法。

例子：

>>> splitString('filesaveas')
[['file', 'save', 'as'], ['files', 'ave', 'as']]

I don't know a library that does this, but it's not too hard to write if you have a list of words:

wordList = file('words.txt','r').read().split()
words = set( s.lower() for s in wordList )

def splitString(s):
    found = []

    def rec(stringLeft, wordsSoFar):
        if not stringLeft:
            found.append(wordsSoFar)
        for pos in xrange(1, len(stringLeft)+1):
            if stringLeft[:pos] in words:
                rec(stringLeft[pos:], wordsSoFar + [stringLeft[:pos]])

    rec(s.lower(), [])
    return found

This will return all possible ways to split the string into the given words.

Example:

>>> splitString('filesaveas')
[['file', 'save', 'as'], ['files', 'ave', 'as']]

回复收藏 0 原文

过度放纵 2024-08-26 23:41:49

可以看这个例子：但是它是用 scala 编写的。
当句子之间不包含空格时，这可以分割您想要的任何内容。

Nonspaced-Sentence-Tokenizer

回复收藏 0 原文

纸短情长 2024-08-26 23:41:49

我知道这个问题是针对 Python 的，但我需要一个 JavaScript 实现。离开之前的答案，我想我应该分享我的代码。看起来工作还不错。

function findWords(input){
    input = input.toLowerCase().replace(/\s/g, ""); //Strip whitespace

    var index = 0;
    var validWords = [];
    for (var len = input.length; len > 0; len--){ //Go backwards as to favor longer words
        var testWord = input.substr(index, len);
        var dictIndex = _dictionary.indexOf(testWord.replace(/[^a-z\']/g, "")); //Remove non-letters
        if (dictIndex != -1){
            validWords.push(testWord);
            if (len == input.length){
                break; //We are complete
            }
            var nextWords = findWords(input.substr(len, input.length - len)); //Recurse
            if (!nextWords.words.length){ //No further valid words
                validWords.pop();
            }
            validWords = validWords.concat(nextWords.words);
            if (nextWords.complete === true){
                break; //Cascade complete
            }
        }
    }
    return {
        complete:len > 0, //We broke which indicates completion
        words:validWords
    };
}

注意：“_dictionary”应该是按频率排序的单词数组。我正在使用古腾堡计划中的词汇表。

I know this question is marked for Python but I needed a JavaScript implementation. Going off of the previous answers I figured I'd share my code. Seems to work decently.

function findWords(input){
    input = input.toLowerCase().replace(/\s/g, ""); //Strip whitespace

    var index = 0;
    var validWords = [];
    for (var len = input.length; len > 0; len--){ //Go backwards as to favor longer words
        var testWord = input.substr(index, len);
        var dictIndex = _dictionary.indexOf(testWord.replace(/[^a-z\']/g, "")); //Remove non-letters
        if (dictIndex != -1){
            validWords.push(testWord);
            if (len == input.length){
                break; //We are complete
            }
            var nextWords = findWords(input.substr(len, input.length - len)); //Recurse
            if (!nextWords.words.length){ //No further valid words
                validWords.pop();
            }
            validWords = validWords.concat(nextWords.words);
            if (nextWords.complete === true){
                break; //Cascade complete
            }
        }
    }
    return {
        complete:len > 0, //We broke which indicates completion
        words:validWords
    };
}

Note: "_dictionary" is expected to be an array of words sorted by frequency. I am using a wordlist from Project Gutenberg.

回复收藏 0 原文

~没有更多了~