有人知道使用动态规划进行分词的示例算法吗?

发布于 2024-08-11 23:37:28 字数 1539 浏览 8 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

最终幸福 2024-08-18 23:37:28

我假设我们在这里讨论的不是微不足道的情况(即不仅仅是在空格周围分割字符串,因为这只是一个基本的分词器问题) - 相反,我们正在讨论那里的一些东西不是一个明确的单词分隔符,因此我们必须“猜测” string->words 的最佳匹配是什么 - 例如,一组没有空格的串联单词的情况,例如将其转换

lotsofwordstogether

lots, of, words, together

:在这种情况下,动态编程方法可能会计算出一个表,其中一个维度对应于序列中的第 M 个单词,另一个维度对应于每个 <输入字符串中的第 n 个字符。那么您为表中每个方格填写的值就是“如果我们在位置 NM 个单词,我们可以获得的最佳匹配分数代码>.

I'm going to assume that we're not talking about the trivial case here (i.e. not just splitting a string around spaces, since that'd just be a basic tokenizer problem) - but instead, we're talking about something were there isn't a clear word delimiter character, and thus we're having to "guess" what the best match for string->words would be - for instance, the case of a set of concatenated words w/o spaces, such as transforming this:

lotsofwordstogether

into this:

lots, of, words, together

In this case, the dynamic programming approach would probably be to calculate out a table where one dimension corresponds to the Mth word in the sequence, and the other dimension corresponds to each Nth character in the input string. Then the value that you fill in for each square of the table is "the best match score we can get if we end (or instead, begin) the Mth word at position N.

蓝天白云 2024-08-18 23:37:28

Python Wordsegment 模块有这样的算法。它使用递归和记忆来实现动态编程。

源代码可在 Github 上找到,以下是相关片段:

def segment(text):
    "Return a list of words that is the best segmenation of `text`."

    memo = dict()

    def search(text, prev='<s>'):
        if text == '':
            return 0.0, []

        def candidates():
            for prefix, suffix in divide(text):
                prefix_score = log10(score(prefix, prev))

                pair = (suffix, prefix)
                if pair not in memo:
                    memo[pair] = search(suffix, prefix)
                suffix_score, suffix_words = memo[pair]

                yield (prefix_score + suffix_score, [prefix] + suffix_words)

        return max(candidates())

    result_score, result_words = search(clean(text))

    return result_words

注意如何< code>memo 缓存对 search 的调用,而 search 又从 candidates 中选择最大值。

The Python wordsegment module has such an algorithm. It uses recursion and memoization to implement dynamic programming.

The source is available on Github, here's the relevant snippet:

def segment(text):
    "Return a list of words that is the best segmenation of `text`."

    memo = dict()

    def search(text, prev='<s>'):
        if text == '':
            return 0.0, []

        def candidates():
            for prefix, suffix in divide(text):
                prefix_score = log10(score(prefix, prev))

                pair = (suffix, prefix)
                if pair not in memo:
                    memo[pair] = search(suffix, prefix)
                suffix_score, suffix_words = memo[pair]

                yield (prefix_score + suffix_score, [prefix] + suffix_words)

        return max(candidates())

    result_score, result_words = search(clean(text))

    return result_words

Note how memo caches calls to search which in turn selects the max from candidates.

岁月染过的梦 2024-08-18 23:37:28

这是迭代风格的以下解决方案(主要思想是将问题分解为:在输入的特定范围内找到恰好具有 1,2,3..n 个分段单词的分段。如果有任何小的索引错误,请原谅,这些天我的头脑很模糊,但这对你来说是一个迭代的解决方案。):

static List<String> connectIter(List<String> tokens) {

    // use instead of array, the key is of the from 'int int'
    Map<String, List<String>> data = new HashMap<String, List<String>>();

    int n = tokens.size();

    for(int i = 0; i < n; i++) {
        String str = concat(tokens, i, n);
        if (dictContains(str)) {
            data.put(1 + " " + i, Collections.singletonList(str));
        }
    }

    for (int i = 2; i <= n; i++) {
        for (int start = 0; i < n; start++) {
            List<String> solutions = new ArrayList<String>();
            for (int end = start + 1; end <= n - i + 1; end++) {
                String firstPart = concat(tokens, start, end);

                if (dictContains(firstPart)) {
                    List<String> partialSolutions = data.get((i - 1) + " " + end);
                    if (partialSolutions != null) {
                        List<String> newSolutions = new ArrayList<>();
                        for (String onePartialSolution : partialSolutions) {
                            newSolutions.add(firstPart + " "
                                    + onePartialSolution);
                        }
                        solutions.addAll(newSolutions);
                    }
                }
            }

            if (solutions.size() != 0) {
                data.put(i + " " + start, solutions);
            }
        }
    }

    List<String> ret = new ArrayList<String>();
    for(int i = 1; i <= n; i++) { // can be optimized to run with less iterations
        List<String> solutions = data.get(i + " " + 0);
        if (solutions != null) {
            ret.addAll(solutions);
        }
    }

    return ret;
}


static private String concat(List<String> tokens, int low, int hi) {
    StringBuilder sb = new StringBuilder();
    for(int i = low; i < hi; i++) {
        sb.append(tokens.get(i));
    }
    return sb.toString();
}

Here is the followings solution in iterative style (The main idea is breaking up problem into: finding segmentation having exactly 1,2,3..n segmented words within a certain range of the input. Excuse me if there are any minor indexing mistakes, my head is very fuzzy these days. But this is an iterative solution for you.):

static List<String> connectIter(List<String> tokens) {

    // use instead of array, the key is of the from 'int int'
    Map<String, List<String>> data = new HashMap<String, List<String>>();

    int n = tokens.size();

    for(int i = 0; i < n; i++) {
        String str = concat(tokens, i, n);
        if (dictContains(str)) {
            data.put(1 + " " + i, Collections.singletonList(str));
        }
    }

    for (int i = 2; i <= n; i++) {
        for (int start = 0; i < n; start++) {
            List<String> solutions = new ArrayList<String>();
            for (int end = start + 1; end <= n - i + 1; end++) {
                String firstPart = concat(tokens, start, end);

                if (dictContains(firstPart)) {
                    List<String> partialSolutions = data.get((i - 1) + " " + end);
                    if (partialSolutions != null) {
                        List<String> newSolutions = new ArrayList<>();
                        for (String onePartialSolution : partialSolutions) {
                            newSolutions.add(firstPart + " "
                                    + onePartialSolution);
                        }
                        solutions.addAll(newSolutions);
                    }
                }
            }

            if (solutions.size() != 0) {
                data.put(i + " " + start, solutions);
            }
        }
    }

    List<String> ret = new ArrayList<String>();
    for(int i = 1; i <= n; i++) { // can be optimized to run with less iterations
        List<String> solutions = data.get(i + " " + 0);
        if (solutions != null) {
            ret.addAll(solutions);
        }
    }

    return ret;
}


static private String concat(List<String> tokens, int low, int hi) {
    StringBuilder sb = new StringBuilder();
    for(int i = low; i < hi; i++) {
        sb.append(tokens.get(i));
    }
    return sb.toString();
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文