快速高效的短语词典查找算法？

发布于 2024-12-03 16:32:00 字数 556 浏览 0 评论 0原文

假设我有一本包含几百万个单词和短语的字典。对于每个输入句子，我想识别（完全匹配）字典包含的所有单词/短语。应优先选择最长的字典名称，并且不能重叠。例如：

Sentence: "Los Angeles Lakers visited Washington State last week"
Dictionary: {Los Angeles, Lakers, Los Angeles Lakers, Washington, State, Washington State University}

Then the sentence would be tagged as follows:
[Los Angeles Lakers] visited [Washington] [State] last week.

我能想到的一种解决方案是将字典存储在内存中，查找时间恒定（例如，基于哈希的集合），然后从每个句子中提取所有单词 n 元组（n 可以设置为单词中的单词数）字典中最长的短语）将每个短语与字典进行比较并保留不重叠的最长短语。有更好的解决方案吗？（因为 n 元语法的生成可能很慢）。也许树木可以帮忙？

谢谢！

原文

Let's say I have a dictionary with a few million words and phrases. For each input sentence I want to identify (exact match) all words/phrases that the dictionary contains. The longest dictionary name should be preferred and no overlaps.
For example:

Sentence: "Los Angeles Lakers visited Washington State last week"
Dictionary: {Los Angeles, Lakers, Los Angeles Lakers, Washington, State, Washington State University}

Then the sentence would be tagged as follows:
[Los Angeles Lakers] visited [Washington] [State] last week.

One solution I can think of is storing the dictionary in memory with constant look-up time (e.g., hash based set) and then extracting all word n-grams from each sentence (n can be set to the number of words in the longest phrase in the dictionary) comparing each against the dictionary and keeping the longest ones that don't overlap. Is there a better solution? (because the n-gram generation can be slow). Maybe trees can help?

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凉城已无爱 2024-12-10 16:32:01

您可能需要考虑类似基数树或前缀树，使用整个单词作为构建的一部分。这些树对于字典类型的问题来说是很自然的。

然后简单地将事物分割成单词，并执行特里结构搜索。根据分组的预期长度，您可以从前面（不情愿）或从后面（贪婪）构建。

回复收藏 0 原文

故人爱我别走 2024-12-10 16:32:01

您可能会查看DAWG（有向非循环字图）。您可以将整个短语存储为 DAWG 中的路径。然后，您将开始匹配句子并找到匹配的最长短语作为最长路径。然后，您将类似地继续处理其余不匹配的句子。空白区域需要一些特殊处理。

回复收藏 0 原文

羁客 2024-12-10 16:32:01

以下代码对句子执行非重叠标记，优先考虑较长的匹配。

它将句子视为字符串标记数组。

它使用“max_key_size”参数（字典中键的最大大小）来避免搜索永远不会发生的匹配。
在您的示例中，生成的句子是：

[[洛杉矶湖人队]，访问过，[华盛顿]，[州]，上周，上周]

希望有帮助：

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;

public class NonOverlappingTagging {

    private static ArrayList non_overlapping_tagging(String[] sentence, HashMap dict,int max_key_size) {
        ArrayList tag_sentence = new ArrayList();
        int N = sentence.length;

        if (max_key_size == -1) {
            max_key_size = N;
        }

        int i = 0;

        while (i < N) {

            boolean tagged = false;
            int j = Math.min(i + max_key_size, N); //avoid overflow

            while (j > i) {
                String[] literal_tokens = Arrays.copyOfRange(sentence, i, j);
                String literal = join(literal_tokens, " ");
                System.out.println(literal);

                if (dict.get(literal) != null) {
                    tag_sentence.add("["+literal+"]");
                    i = j;
                    tagged = true;
                }
                else {
                    j -= 1;
                }

            }

            if (!tagged) {
                tag_sentence.add(sentence[i]);
                i += 1;
            }
        }

        return tag_sentence;
    }

    private static String join(String[] sentence, String separator) {
        String result = "";
        for (int i = 0; i < sentence.length; i++) {
            String word = sentence[i];
            result += word + separator;
        }

        return result.trim();
    }

    public static void main(String[] args) {
        String[] sentence = {"Los", "Angeles", "Lakers", "visited", "Washington", "State", "last", "week"};
        HashMap <String, Integer>dict = new HashMap();
        dict.put("Los Angeles", 1);
        dict.put("Lakers", 1);
        dict.put("Los Angeles Lakers", 1);
        dict.put("Washington", 1);
        dict.put("State", 1);
        dict.put("Washington State University", 1);

        ArrayList tagging = non_overlapping_tagging(sentence, dict, -1);

        System.out.println(tagging);    
    }   
}

The following code performs non-overlapping tagging of a sentence giving priority to longer matches.

It treats a sentence as an array of string tokens.

It uses a "max_key_size" parameter (maximum size of a key in your dictionary) to avoid searching matches that will never happen.
In your example, the resulting sentence is:

[[Los Angeles Lakers], visited, [Washington], [State], last, week]

Hope it helps:

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;

public class NonOverlappingTagging {

    private static ArrayList non_overlapping_tagging(String[] sentence, HashMap dict,int max_key_size) {
        ArrayList tag_sentence = new ArrayList();
        int N = sentence.length;

        if (max_key_size == -1) {
            max_key_size = N;
        }

        int i = 0;

        while (i < N) {

            boolean tagged = false;
            int j = Math.min(i + max_key_size, N); //avoid overflow

            while (j > i) {
                String[] literal_tokens = Arrays.copyOfRange(sentence, i, j);
                String literal = join(literal_tokens, " ");
                System.out.println(literal);

                if (dict.get(literal) != null) {
                    tag_sentence.add("["+literal+"]");
                    i = j;
                    tagged = true;
                }
                else {
                    j -= 1;
                }

            }

            if (!tagged) {
                tag_sentence.add(sentence[i]);
                i += 1;
            }
        }

        return tag_sentence;
    }

    private static String join(String[] sentence, String separator) {
        String result = "";
        for (int i = 0; i < sentence.length; i++) {
            String word = sentence[i];
            result += word + separator;
        }

        return result.trim();
    }

    public static void main(String[] args) {
        String[] sentence = {"Los", "Angeles", "Lakers", "visited", "Washington", "State", "last", "week"};
        HashMap <String, Integer>dict = new HashMap();
        dict.put("Los Angeles", 1);
        dict.put("Lakers", 1);
        dict.put("Los Angeles Lakers", 1);
        dict.put("Washington", 1);
        dict.put("State", 1);
        dict.put("Washington State University", 1);

        ArrayList tagging = non_overlapping_tagging(sentence, dict, -1);

        System.out.println(tagging);    
    }   
}

回复收藏 0 原文

~没有更多了~