快速高效的短语词典查找算法?

发布于 2024-12-03 16:32:00 字数 556 浏览 0 评论 0原文

假设我有一本包含几百万个单词和短语的字典。对于每个输入句子,我想识别(完全匹配)字典包含的所有单词/短语。应优先选择最长的字典名称,并且不能重叠。 例如:

Sentence: "Los Angeles Lakers visited Washington State last week"
Dictionary: {Los Angeles, Lakers, Los Angeles Lakers, Washington, State, Washington State University}

Then the sentence would be tagged as follows:
[Los Angeles Lakers] visited [Washington] [State] last week. 

我能想到的一种解决方案是将字典存储在内存中,查找时间恒定(例如,基于哈希的集合),然后从每个句子中提取所有单词 n 元组(n 可以设置为单词中的单词数)字典中最长的短语)将每个短语与字典进行比较并保留不重叠的最长短语。有更好的解决方案吗? (因为 n 元语法的生成可能很慢)。也许树木可以帮忙?

谢谢!

Let's say I have a dictionary with a few million words and phrases. For each input sentence I want to identify (exact match) all words/phrases that the dictionary contains. The longest dictionary name should be preferred and no overlaps.
For example:

Sentence: "Los Angeles Lakers visited Washington State last week"
Dictionary: {Los Angeles, Lakers, Los Angeles Lakers, Washington, State, Washington State University}

Then the sentence would be tagged as follows:
[Los Angeles Lakers] visited [Washington] [State] last week. 

One solution I can think of is storing the dictionary in memory with constant look-up time (e.g., hash based set) and then extracting all word n-grams from each sentence (n can be set to the number of words in the longest phrase in the dictionary) comparing each against the dictionary and keeping the longest ones that don't overlap. Is there a better solution? (because the n-gram generation can be slow). Maybe trees can help?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

凉城已无爱 2024-12-10 16:32:01

您可能需要考虑类似 基数树前缀树,使用整个单词作为构建的一部分。这些树对于字典类型的问题来说是很自然的。

然后简单地将事物分割成单词,并执行特里结构搜索。根据分组的预期长度,您可以从前面(不情愿)或从后面(贪婪)构建。

You may want to consider something like a radix tree or a prefix tree, using whole words as part of your buildup. These are the kinds of trees that are natural to dictionary type problems.

Then simply split things into words, and perform a search of the trie. Depending on the expected length of the groupings, you would build from the front (reluctant) or from the back (greedy).

故人爱我别走 2024-12-10 16:32:01

您可能会查看DAWG(有向非循环字图)。您可以将整个短语存储为 DAWG 中的路径。然后,您将开始匹配句子并找到匹配的最长短语作为最长路径。然后,您将类似地继续处理其余不匹配的句子。空白区域需要一些特殊处理。

You might look at DAWG (Directed acyclic word graph). You would store the whole phrases as paths in DAWG. Then, you would start to match the sentence and find the longest phrase that matches as the longest path. You would then proceed with the rest unmatched sentence similarly. White space would need some special handling.

羁客 2024-12-10 16:32:01

以下代码对句子执行非重叠标记,优先考虑较长的匹配。

它将句子视为字符串标记数组。

它使用“max_key_size”参数(字典中键的最大大小)来避免搜索永远不会发生的匹配。
在您的示例中,生成的句子是:

[[洛杉矶湖人队],访问过,[华盛顿],[州],上周,上周]

希望有帮助:

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;

public class NonOverlappingTagging {

    private static ArrayList non_overlapping_tagging(String[] sentence, HashMap dict,int max_key_size) {
        ArrayList tag_sentence = new ArrayList();
        int N = sentence.length;

        if (max_key_size == -1) {
            max_key_size = N;
        }

        int i = 0;

        while (i < N) {

            boolean tagged = false;
            int j = Math.min(i + max_key_size, N); //avoid overflow

            while (j > i) {
                String[] literal_tokens = Arrays.copyOfRange(sentence, i, j);
                String literal = join(literal_tokens, " ");
                System.out.println(literal);

                if (dict.get(literal) != null) {
                    tag_sentence.add("["+literal+"]");
                    i = j;
                    tagged = true;
                }
                else {
                    j -= 1;
                }

            }

            if (!tagged) {
                tag_sentence.add(sentence[i]);
                i += 1;
            }
        }

        return tag_sentence;
    }

    private static String join(String[] sentence, String separator) {
        String result = "";
        for (int i = 0; i < sentence.length; i++) {
            String word = sentence[i];
            result += word + separator;
        }

        return result.trim();
    }

    public static void main(String[] args) {
        String[] sentence = {"Los", "Angeles", "Lakers", "visited", "Washington", "State", "last", "week"};
        HashMap <String, Integer>dict = new HashMap();
        dict.put("Los Angeles", 1);
        dict.put("Lakers", 1);
        dict.put("Los Angeles Lakers", 1);
        dict.put("Washington", 1);
        dict.put("State", 1);
        dict.put("Washington State University", 1);

        ArrayList tagging = non_overlapping_tagging(sentence, dict, -1);

        System.out.println(tagging);    
    }   
}

The following code performs non-overlapping tagging of a sentence giving priority to longer matches.

It treats a sentence as an array of string tokens.

It uses a "max_key_size" parameter (maximum size of a key in your dictionary) to avoid searching matches that will never happen.
In your example, the resulting sentence is:

[[Los Angeles Lakers], visited, [Washington], [State], last, week]

Hope it helps:

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;

public class NonOverlappingTagging {

    private static ArrayList non_overlapping_tagging(String[] sentence, HashMap dict,int max_key_size) {
        ArrayList tag_sentence = new ArrayList();
        int N = sentence.length;

        if (max_key_size == -1) {
            max_key_size = N;
        }

        int i = 0;

        while (i < N) {

            boolean tagged = false;
            int j = Math.min(i + max_key_size, N); //avoid overflow

            while (j > i) {
                String[] literal_tokens = Arrays.copyOfRange(sentence, i, j);
                String literal = join(literal_tokens, " ");
                System.out.println(literal);

                if (dict.get(literal) != null) {
                    tag_sentence.add("["+literal+"]");
                    i = j;
                    tagged = true;
                }
                else {
                    j -= 1;
                }

            }

            if (!tagged) {
                tag_sentence.add(sentence[i]);
                i += 1;
            }
        }

        return tag_sentence;
    }

    private static String join(String[] sentence, String separator) {
        String result = "";
        for (int i = 0; i < sentence.length; i++) {
            String word = sentence[i];
            result += word + separator;
        }

        return result.trim();
    }

    public static void main(String[] args) {
        String[] sentence = {"Los", "Angeles", "Lakers", "visited", "Washington", "State", "last", "week"};
        HashMap <String, Integer>dict = new HashMap();
        dict.put("Los Angeles", 1);
        dict.put("Lakers", 1);
        dict.put("Los Angeles Lakers", 1);
        dict.put("Washington", 1);
        dict.put("State", 1);
        dict.put("Washington State University", 1);

        ArrayList tagging = non_overlapping_tagging(sentence, dict, -1);

        System.out.println(tagging);    
    }   
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文