soundex算法的数据结构?

发布于 2024-07-08 16:25:16 字数 253 浏览 8 评论 0原文

谁能建议我在 soundex 算法 程序中使用什么数据结构? 使用的语言是Java。 如果有人以前用 Java 做过这个工作。 该程序应具有以下功能: 能够阅读约50,000字 应该能够读取一个单词并返回具有相同 soundex 的相关单词

我不希望程序实现只是关于使用什么数据结构的一些建议。

Can anyone suggest me on what data structure to use for a soundex algorithm program? The language to be used is Java. If anybody has worked on this before in Java. The program should have these features:
be able to read about 50,000 words
should be able to read a word and return the related words having the same soundex

I don't want the program implementation just few advices on what data structure to use.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

我爱人 2024-07-15 16:25:16

提示:如果您使用 SQL 作为数据后端,那么您可以让 SQL 使用两个 sql 函数 SOUNDEX 和 DIFFERENCE 来处理它。

也许不是你想要的,但很多人不知道MSsql有这两个功能。

TIP: If you use SQL as a databackend then you can let SQL handle it with the two sql-functions SOUNDEX and DIFFERENCE.

Maybe not what you wanted, but many people do not know that MSsql has those two functions.

愿得七秒忆 2024-07-15 16:25:16

soundex 可以通过直接传递字符串来实现,因此不需要任何特殊的东西。

之后,4 个字符的代码可以被视为一个整数密钥。

然后构建一个字典来存储由该整数键索引的单词集。 50,000 个单词应该很容易就能记住,所以不需要任何花哨的东西。

然后遍历字典,每个桶都是一组发音相似的单词。

实际上,这是 perl 中的整个程序:

#!/usr/bin/perl
use Text::Soundex;
use Data::Dumper;
open(DICT,"</usr/share/dict/linux.words");
my %dictionary = ();
while (<DICT>) {
        chomp();
        chomp();
        push @{$dictionary{soundex($_)}},$_;
}
close(DICT);
while (<>) {
        my @words = split / +/;
        foreach (@words) {
            print Dumper $dictionary{soundex($_)};
        }
}

Well soundex can be implemented in a straightforward pass over a string, so that doesn't require anything special.

After that the 4 character code can be treated as an integer key.

Then just build a dictionary that stores word sets indexed by that integer key. 50,000 words should easily fit into memory so nothing fancy is required.

Then walk the dictionary and each bucket is a group of similar sounding words.

Actually, here is the whole program in perl:

#!/usr/bin/perl
use Text::Soundex;
use Data::Dumper;
open(DICT,"</usr/share/dict/linux.words");
my %dictionary = ();
while (<DICT>) {
        chomp();
        chomp();
        push @{$dictionary{soundex($_)}},$_;
}
close(DICT);
while (<>) {
        my @words = split / +/;
        foreach (@words) {
            print Dumper $dictionary{soundex($_)};
        }
}
烟酉 2024-07-15 16:25:16

我相信你只需要将原始字符串转换为 soundex 键并转换为哈希表即可; 表中每个条目的值将是映射到该 soundex 的原始字符串的集合。

Google 集合中的 MultiMap 集合接口(及其实现)会对您有用。

I believe you just need to convert the original strings into soundex keys into a hashtable; the value for each entry in the table would be a collection of original strings mapping to that soundex.

The MultiMap collection interface (and its implementations) in Google Collections would be useful to you.

遥远的绿洲 2024-07-15 16:25:16
class SpellChecker
{

  interface Hash {
    String hash(String);
  }

  private final Hash hash;

  private final Map<String, Set<String>> collisions;

  SpellChecker(Hash hash) {
    this.hash = hash;
    collisions = new TreeSet<String, Set<String>>();
  }

  boolean addWord(String word) {
    String key = hash.hash(word);
    Set<String> similar = collisions.get(key);
    if (similar == null)
      collisions.put(key, similar = new TreeSet<String>());
    return similar.add(word);
  }

  Set<String> similar(String word) {
    Set<String> similar = collisions.get(hash.hash(word));
    if (similar == null)
      return Collections.emptySet();
    else
      return Collections.unmodifiableSet(similar);
  }

}

哈希策略可以是 Soundex、Metaphone 或您拥有的任何策略。 有些策略可能是可调的(输出多少个字符等)

class SpellChecker
{

  interface Hash {
    String hash(String);
  }

  private final Hash hash;

  private final Map<String, Set<String>> collisions;

  SpellChecker(Hash hash) {
    this.hash = hash;
    collisions = new TreeSet<String, Set<String>>();
  }

  boolean addWord(String word) {
    String key = hash.hash(word);
    Set<String> similar = collisions.get(key);
    if (similar == null)
      collisions.put(key, similar = new TreeSet<String>());
    return similar.add(word);
  }

  Set<String> similar(String word) {
    Set<String> similar = collisions.get(hash.hash(word));
    if (similar == null)
      return Collections.emptySet();
    else
      return Collections.unmodifiableSet(similar);
  }

}

The hash strategy could be Soundex, Metaphone, or what have you. Some strategies might be tunable (how many characters does it output, etc.)

你在我安 2024-07-15 16:25:16

由于 soundex 是一个哈希,因此我将使用哈希表,以 soundex 作为键。

Since soundex is a hash, I'd use a hash table, with the soundex as the key.

绝對不後悔。 2024-07-15 16:25:16

你想要一个 4 字节整数。

soundex 算法始终返回 4 个字符的代码,如果您使用 ANSI 输入,您将得到 4 个字节(表示为 4 个字母)。

因此,将返回的代码存储在哈希表中,将单词转换为代码并在哈希表中查找。 真的就是这么简单。

you want a 4-byte integer.

The soundex algorithm always returns a 4-character code, if you use ANSI inputs, you'll get 4-bytes back (represented as 4 letters).

So store the codes returned in a hashtable, convert your word to the code and look it up in the hashtable. Its really that easy.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文