快速高效的短语词典查找算法?
假设我有一本包含几百万个单词和短语的字典。对于每个输入句子,我想识别(完全匹配)字典包含的所有单词/短语。应优先选择最长的字典名称,并且不能重叠。 例如:
Sentence: "Los Angeles Lakers visited Washington State last week"
Dictionary: {Los Angeles, Lakers, Los Angeles Lakers, Washington, State, Washington State University}
Then the sentence would be tagged as follows:
[Los Angeles Lakers] visited [Washington] [State] last week.
我能想到的一种解决方案是将字典存储在内存中,查找时间恒定(例如,基于哈希的集合),然后从每个句子中提取所有单词 n 元组(n 可以设置为单词中的单词数)字典中最长的短语)将每个短语与字典进行比较并保留不重叠的最长短语。有更好的解决方案吗? (因为 n 元语法的生成可能很慢)。也许树木可以帮忙?
谢谢!
Let's say I have a dictionary with a few million words and phrases. For each input sentence I want to identify (exact match) all words/phrases that the dictionary contains. The longest dictionary name should be preferred and no overlaps.
For example:
Sentence: "Los Angeles Lakers visited Washington State last week"
Dictionary: {Los Angeles, Lakers, Los Angeles Lakers, Washington, State, Washington State University}
Then the sentence would be tagged as follows:
[Los Angeles Lakers] visited [Washington] [State] last week.
One solution I can think of is storing the dictionary in memory with constant look-up time (e.g., hash based set) and then extracting all word n-grams from each sentence (n can be set to the number of words in the longest phrase in the dictionary) comparing each against the dictionary and keeping the longest ones that don't overlap. Is there a better solution? (because the n-gram generation can be slow). Maybe trees can help?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可能需要考虑类似 基数树 或 前缀树,使用整个单词作为构建的一部分。这些树对于字典类型的问题来说是很自然的。
然后简单地将事物分割成单词,并执行特里结构搜索。根据分组的预期长度,您可以从前面(不情愿)或从后面(贪婪)构建。
You may want to consider something like a radix tree or a prefix tree, using whole words as part of your buildup. These are the kinds of trees that are natural to dictionary type problems.
Then simply split things into words, and perform a search of the trie. Depending on the expected length of the groupings, you would build from the front (reluctant) or from the back (greedy).
您可能会查看DAWG(有向非循环字图)。您可以将整个短语存储为 DAWG 中的路径。然后,您将开始匹配句子并找到匹配的最长短语作为最长路径。然后,您将类似地继续处理其余不匹配的句子。空白区域需要一些特殊处理。
You might look at DAWG (Directed acyclic word graph). You would store the whole phrases as paths in DAWG. Then, you would start to match the sentence and find the longest phrase that matches as the longest path. You would then proceed with the rest unmatched sentence similarly. White space would need some special handling.
以下代码对句子执行非重叠标记,优先考虑较长的匹配。
它将句子视为字符串标记数组。
它使用“max_key_size”参数(字典中键的最大大小)来避免搜索永远不会发生的匹配。
在您的示例中,生成的句子是:
[[洛杉矶湖人队],访问过,[华盛顿],[州],上周,上周]
希望有帮助:
The following code performs non-overlapping tagging of a sentence giving priority to longer matches.
It treats a sentence as an array of string tokens.
It uses a "max_key_size" parameter (maximum size of a key in your dictionary) to avoid searching matches that will never happen.
In your example, the resulting sentence is:
[[Los Angeles Lakers], visited, [Washington], [State], last, week]
Hope it helps: