基于音译的单词查找的高效数据结构/算法
我正在寻找一种有效的数据结构/算法来存储和搜索基于音译的单词查找(就像谷歌所做的那样:http: //www.google.com/transliterate/ 但我并没有尝试使用谷歌音译API)。不幸的是,我正在尝试使用的自然语言没有实现任何 soundex,所以我只能靠自己了。
对于目前的开源项目,我使用普通数组来存储单词列表并动态生成正则表达式(基于用户输入)来匹配它们。它工作正常,但正则表达式太强大或资源密集程度超出了我的需要。例如,如果我尝试将其移植到手持设备,我担心该解决方案会消耗太多电池,因为使用正则表达式搜索数千个单词的成本太高。
对于复杂的语言,必须有更好的方法来实现这一点,例如拼音输入法是如何工作的?关于从哪里开始有什么建议吗?
提前致谢。
编辑:如果我理解正确,这是@Dialecticus建议的-
我想从Language1(有3个字符a,b,c
)音译到Language2< /strong>,有 6 个字符 p,q,r,x,y,z
。由于每种语言拥有的字符数量及其音素不同,通常不可能定义一对一的映射。
让我们假设这里的语音是我们的关联数组/音译表:
a -> p, q
b -> r
c -> x, y, z
我们还有一个用于 Language2 的普通数组中的有效单词列表:
...
px
qy
...
如果用户输入 ac
,则可能的组合将变为 <音译步骤 1 后的 code>px, py, pz, qx, qy, qz。在步骤 2 中,我们必须在有效单词列表中进行另一次搜索,并且必须消除除其中的所有单词之外的所有单词px
和 qy
。
我目前所做的与上述方法没有什么不同。我没有使用音译表进行可能的组合,而是构建了一个正则表达式 [pq][xyz]
并将其与我的有效单词列表进行匹配,该列表提供了输出 px
和qy
。
我很想知道是否还有比这更好的方法。
I'm looking for a efficient data structure/algorithm for storing and searching transliteration based word lookup (like google do: http://www.google.com/transliterate/ but I'm not trying to use google transliteration API). Unfortunately, the natural language I'm trying to work on doesn't have any soundex implemented, so I'm on my own.
For an open source project currently I'm using plain arrays for storing word list and dynamically generating regular expression (based on user input) to match them. It works fine, but regular expression is too powerful or resource intensive than I need. For example, I'm afraid this solution will drain too much battery if I try to port it to handheld devices, as searching over thousands of words with regular expression is too much costly.
There must be a better way to accomplish this for complex languages, how does Pinyin input method work for example? Any suggestion on where to start?
Thanks in advance.
Edit: If I understand correctly, this is suggested by @Dialecticus-
I want to transliterate from Language1, which has 3 characters a,b,c
to Language2, which has 6 characters p,q,r,x,y,z
. As a result of difference in numbers of characters each language possess and their phones, it is not often possible to define one-to-one mapping.
Lets assume phonetically here is our associative arrays/transliteration table:
a -> p, q
b -> r
c -> x, y, z
We also have a valid word lists in plain arrays for Language2:
...
px
qy
...
If the user types ac
, the possible combinations become px, py, pz, qx, qy, qz
after transliteration step 1. In step 2 we have to do another search in valid word list and will have to eliminate everyone of them except px
and qy
.
What I'm doing currently is not that different from the above approach. Instead of making possible combinations using the transliteration table, I'm building a regular expression [pq][xyz]
and matching that with my valid word list, which provides the output px
and qy
.
I'm eager to know if there is any better method than that.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
据我了解,您有一个字母表中的输入字符串 S(我们称之为 A1),并且您希望将其转换为字符串 S',这与另一个字母表 A2 中的字符串 S' 等效。实际上,如果我理解正确的话,您想要生成一个输出字符串列表 [S'1,S'2,...,S'n] ,它可能相当于 S。
想到的一种方法是针对每个A2 中的有效单词列表中的单词生成 A1 中匹配的字符串列表。使用您编辑中的示例,我们
(为了清楚起见,我添加了一个额外的有效单词
pr
)现在我们知道什么可能的输入符号系列将始终映射到有效单词,我们可以使用我们的表来构建 Trie。
每个节点将保存一个指向 A2 中有效单词列表的指针,该列表映射到 A1 中的符号序列,这些符号序列形成从 Trie 的根到当前节点的路径。
因此,对于我们的示例,Trie 看起来像这样
: 从根节点开始,随着符号被消耗,从当前节点到标有消耗符号的子节点进行转换,直到我们读取整个字符串。如果在任何时候没有为该符号定义转换,则输入的字符串在我们的字典树中不存在,因此不会映射到目标语言中的有效单词。否则,在该过程结束时,与当前节点关联的单词列表是输入字符串映射到的有效单词列表。
除了构建 trie 的初始成本(如果我们不想更改有效单词列表,则可以预先构建 trie),这需要 O(n) 输入的长度才能找到有效映射列表字。
使用 Trie 还提供了一个优点,即您还可以使用它来查找所有有效单词的列表,这些单词可以通过在输入末尾添加更多符号(即前缀匹配)来生成。例如,如果输入输入符号“a”,我们可以使用 trie 查找所有以“a”开头的有效单词(“px”、“qr”、“py”)。但这样做并不像找到精确匹配那么快。
这是一个解决方案的快速破解(Java 中):
这是一个非常简单的 trie,没有压缩或加速,并且仅适用于输入语言的小写英文字符。但可以轻松修改为其他字符集。
From what I understand, you have an input string S in an alphabet (lets call it A1) and you want to convert it to the string S' which is its equivalent in another alphabet A2. Actually, if I understand correctly, you want to generate a list [S'1,S'2,...,S'n] of output strings which might potentially be equivalent to S.
One approach that comes to mind is for each word in the list of valid words in A2 generate a list of strings in A1 that matches the. Using the example in your edit, we have
(I have added an extra valid word
pr
for clarity)Now that we know what possible series of input symbols will always map to a valid word, we can use our table to build a Trie.
Each node will hold a pointer to a list of valid words in A2 that map to the sequence of symbols in A1 that form the path from the root of the Trie to the current node.
Thus for our example, the Trie would look something like this
Starting at the root node, as symbols are consumed transitions are made from the current node to its child marked with the symbol consumed until we have read the entire string. If at any point no transition is defined for that symbol, the entered string does not exist in our trie and thus does not map to a valid word in our target language. Otherwise, at the end of the process, the list of words associated with the current node is the list of valid words the input string maps to.
Apart from the initial cost of building the trie (the trie can be shipped pre-built if we never want the list of valid words to change), this takes O(n) on the length of the input to find a list of mapping valid words.
Using a Trie also provide the advantage that you can also use it to find the list of all valid words that can be generated by adding more symbols to the end of the input - i.e. a prefix match. For example, if fed with the input symbol 'a', we can use the trie to find all valid words that can begin with 'a' ('px','qr','py'). But doing that is not as fast as finding the exact match.
Here's a quick hack at a solution (in Java):
This is a very simple trie, no compression or speedups and only works on lower case English characters for the input language. But it can be easily modified for other character sets.
我会一次构建一个符号的音译句子,而不是一次构建一个单词。对于大多数语言来说,可以独立于单词中的其他符号来音译每个符号。您仍然可以有作为完整单词的例外,必须音译为完整单词,但符号和例外的音译表肯定会小于所有现有单词的音译表。
音译表的最佳结构是某种关联数组,可能利用哈希表。在 C++ 中,有
std::unordered_map
,并且在C# 您将使用字典
。I would build transliterated sentence one symbol at the time, instead of one word at the time. For most languages it is possible to transliterate every symbol independently of other symbols in the word. You can still have exceptions as whole words that have to be transliterated as complete words, but transliteration table of symbols and exceptions will surely be smaller than transliteration table of all existing words.
Best structure for transliteration table is some sort of associative array, probably utilizing hash tables. In C++ there's
std::unordered_map
, and in C# you would useDictionary
.