soundex算法的数据结构?
谁能建议我在 soundex 算法 程序中使用什么数据结构? 使用的语言是Java。 如果有人以前用 Java 做过这个工作。 该程序应具有以下功能: 能够阅读约50,000字 应该能够读取一个单词并返回具有相同 soundex 的相关单词
我不希望程序实现只是关于使用什么数据结构的一些建议。
Can anyone suggest me on what data structure to use for a soundex algorithm program? The language to be used is Java. If anybody has worked on this before in Java. The program should have these features:
be able to read about 50,000 words
should be able to read a word and return the related words having the same soundex
I don't want the program implementation just few advices on what data structure to use.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
提示:如果您使用 SQL 作为数据后端,那么您可以让 SQL 使用两个 sql 函数 SOUNDEX 和 DIFFERENCE 来处理它。
也许不是你想要的,但很多人不知道MSsql有这两个功能。
TIP: If you use SQL as a databackend then you can let SQL handle it with the two sql-functions SOUNDEX and DIFFERENCE.
Maybe not what you wanted, but many people do not know that MSsql has those two functions.
soundex 可以通过直接传递字符串来实现,因此不需要任何特殊的东西。
之后,4 个字符的代码可以被视为一个整数密钥。
然后构建一个字典来存储由该整数键索引的单词集。 50,000 个单词应该很容易就能记住,所以不需要任何花哨的东西。
然后遍历字典,每个桶都是一组发音相似的单词。
实际上,这是 perl 中的整个程序:
Well soundex can be implemented in a straightforward pass over a string, so that doesn't require anything special.
After that the 4 character code can be treated as an integer key.
Then just build a dictionary that stores word sets indexed by that integer key. 50,000 words should easily fit into memory so nothing fancy is required.
Then walk the dictionary and each bucket is a group of similar sounding words.
Actually, here is the whole program in perl:
我相信你只需要将原始字符串转换为 soundex 键并转换为哈希表即可; 表中每个条目的值将是映射到该 soundex 的原始字符串的集合。
Google 集合中的 MultiMap 集合接口(及其实现)会对您有用。
I believe you just need to convert the original strings into soundex keys into a hashtable; the value for each entry in the table would be a collection of original strings mapping to that soundex.
The MultiMap collection interface (and its implementations) in Google Collections would be useful to you.
哈希策略可以是 Soundex、Metaphone 或您拥有的任何策略。 有些策略可能是可调的(输出多少个字符等)
The hash strategy could be Soundex, Metaphone, or what have you. Some strategies might be tunable (how many characters does it output, etc.)
由于 soundex 是一个哈希,因此我将使用哈希表,以 soundex 作为键。
Since soundex is a hash, I'd use a hash table, with the soundex as the key.
你想要一个 4 字节整数。
soundex 算法始终返回 4 个字符的代码,如果您使用 ANSI 输入,您将得到 4 个字节(表示为 4 个字母)。
因此,将返回的代码存储在哈希表中,将单词转换为代码并在哈希表中查找。 真的就是这么简单。
you want a 4-byte integer.
The soundex algorithm always returns a 4-character code, if you use ANSI inputs, you'll get 4-bytes back (represented as 4 letters).
So store the codes returned in a hashtable, convert your word to the code and look it up in the hashtable. Its really that easy.