查找字符串值的最快方法

发布于 2024-12-05 10:19:49 字数 761 浏览 5 评论 0原文

我有一个简单的应用程序，可以从大型文本文件中读取小字符串中的数据并将其保存到数据库中。为了实际保存每个这样的字符串，应用程序多次（可能数千次或更多）调用以下方法：

setValue(String value)
{
    if (!ignore(value))
    {
         // Save the value in the database
    }
}

目前，我通过连续比较一组字符串来实现ignore（）方法，例如

public boolean ignore(String value)
{
    if (value.equalsIgnoreCase("Value 1") || (value.equalsIgnoreCase("Value 2"))
    {
        return true;
    }

    return false;
}

但是，因为我需要检查许多此类“可忽略”值，这些值将在代码的另一部分中定义，所以我需要使用数据结构进行此检查，而不是多个连续的 if 语句。

所以，我的问题是，从标准 Java 到实现这一点最快的数据结构是什么？哈希映射？一套？还有别的事吗？

初始化时间不是问题，因为它将静态发生并且每次应用程序调用一次。

编辑：到目前为止建议的解决方案（包括 HashSet）看起来比仅使用 String[] 包含所有被忽略的单词并仅针对每个单词运行“equalsIgnoreCase”要慢。

原文

I have a simple application that reads data in small strings from large text files and saves them to a database. To actually save each such String, the application calls the following method several (may thousands, or more) times:

setValue(String value)
{
    if (!ignore(value))
    {
         // Save the value in the database
    }
}

Currently, I implement the ignore() method by just successively comparing a set of Strings, e.g.

public boolean ignore(String value)
{
    if (value.equalsIgnoreCase("Value 1") || (value.equalsIgnoreCase("Value 2"))
    {
        return true;
    }

    return false;
}

However, because I need to check against many such "ignorable" values, which will be defined in another part of the code, I need to use a data structure for this check, instead of multiple consecutive if statements.

So, my question is, what would be the fastest data structure from standard Java to to implement this? A HashMap? A Set? Something else?

Initialization time is not an issue, since it will happen statically and once per application invocation.

EDIT: The solutions suggested thus far (including HashSet) appear slower than just using a String[] with all the ignored words and just running "equalsIgnoreCase" against each of these.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

坦然微笑 2024-12-12 10:19:49

使用 HashSet，以小写形式存储值，及其 contains() 方法，它比 TreeSet 具有更好的查找性能（contains 的恒定时间与对数时间）。

Set<String> ignored = new HashSet<String>();
ignored.add("value 1"); // store in lowercase
ignored.add("value 2"); // store in lowercase

public boolean ignore(String value) {
    return ignored.contains(value.toLowerCase());    
}

以小写形式存储值并搜索小写输入可以避免在比较期间处理大小写的麻烦，因此您可以获得 HashSet 实现的全速和零编写的与集合相关的代码（例如 Collator、比较器等）。

已编辑
感谢 Jon Skeet 指出，某些土耳其语字符在调用 toLowerCase() 时表现得很奇怪，但如果您不打算支持土耳其语输入（或者可能是其他具有非标准大小写问题的语言），那么这种方法很适合你。

Use a HashSet, storing the values in lowercase, and its contains() method, which has better lookup performance than TreeSet (constant-time versus log-time for contains).

Set<String> ignored = new HashSet<String>();
ignored.add("value 1"); // store in lowercase
ignored.add("value 2"); // store in lowercase

public boolean ignore(String value) {
    return ignored.contains(value.toLowerCase());    
}

Storing the values in lowercase and searching for the lowercased input avoids the hassle of dealing with case during comparison, so you get the full speed of the HashSet implementation and zero collection-related code to write (eg Collator, Comparator etc).

EDITED
Thanks to Jon Skeet for pointing out that certain Turkish characters behave oddly when calling toLowerCase(), but if you're not intending on supporting Turkish input (or perhaps other languages with non-standard case issues) then this approach will work well for you.

回复收藏 0 原文

帅气称霸 2024-12-12 10:19:49

在大多数情况下，我通常会从 HashSet 开始 - 但由于您想要不区分大小写，这使得它稍微困难一些。

您可以尝试使用 TreeSet

Collator collator = Collator.getInstance(Locale.US);
collator.setStrength(Collator.SECONDARY);

TreeSet<Object> set = new TreeSet<Object>(collator);

请注意，您无法创建 TreeSet，因为 Collator 仅实现 Comparator

编辑：虽然上述版本仅适用于字符串，但创建 TreeSet可能会更快：

Collator collator = Collator.getInstance(Locale.US);
collator.setStrength(Collator.SECONDARY);

TreeSet<CollationKey> set = new TreeSet<CollationKey>();
for (String value : valuesToIgnore) {
    set.add(collator.getCollationKey(value));
}

那么：

public boolean ignore(String value)
{
    return set.contains(collator.getCollationKey(value));
}

那就nice有一种方法可以存储所有被忽略的值的排序规则键，但在测试时避免创建新的排序规则键，但我不知道有什么方法可以做到这一点。

In most cases I'd normally start with a HashSet<String> - but as you want case-insensitivity, that makes it slightly harder.

You can try using a TreeSet<Object> using an appropriate Collator for case-insensitivity. For example:

Collator collator = Collator.getInstance(Locale.US);
collator.setStrength(Collator.SECONDARY);

TreeSet<Object> set = new TreeSet<Object>(collator);

Note that you can't create a TreeSet<String> as Collator only implements Comparator<Object>.

EDIT: While the above version works with just strings, it may be faster to create a TreeSet<CollationKey>:

Collator collator = Collator.getInstance(Locale.US);
collator.setStrength(Collator.SECONDARY);

TreeSet<CollationKey> set = new TreeSet<CollationKey>();
for (String value : valuesToIgnore) {
    set.add(collator.getCollationKey(value));
}

Then:

public boolean ignore(String value)
{
    return set.contains(collator.getCollationKey(value));
}

It would be nice to have a way of storing the collation keys for all ignored values but then avoid creating new collation keys when testing, but I don't know of a way of doing that.

回复收藏 0 原文

紫罗兰の梦幻 2024-12-12 10:19:49

将要忽略的单词添加到列表中，然后检查该单词是否在该列表中。

这使它成为动态的。

回复收藏 0 原文

再见回来 2024-12-12 10:19:49

如果使用 Java 7，这是一种快速的方法：

public boolean ignore(String value) {
  switch(value.toLowerCase()) { // see comment Jon Skeet
    case "lowercased_ignore_value1":
    case "lowercased_ignore_value2":
      // etc
      return true;
    default:
      return false;
  }
}

If using Java 7 this is a fast way to do it:

public boolean ignore(String value) {
  switch(value.toLowerCase()) { // see comment Jon Skeet
    case "lowercased_ignore_value1":
    case "lowercased_ignore_value2":
      // etc
      return true;
    default:
      return false;
  }
}

回复收藏 0 原文

臻嫒无言 2024-12-12 10:19:49

看来 String[] 比建议的其他方法稍微好一些（性能方面），所以我将使用它。

它就像这样：

public boolean ignore(String value)
{
    for (String ignore:IGNORED_VALUES)
    {
        if (ignore.equalsIgnoreCase(value))
        {
            return true;
        }

        return false;
    }

IGNORED_VALUES 对象只是一个 String[] ，其中包含所有被忽略的值。

It seems that String[] is slightly better (performance-wise) than the other methods proposed, so I will use that.

It is simply something like this:

public boolean ignore(String value)
{
    for (String ignore:IGNORED_VALUES)
    {
        if (ignore.equalsIgnoreCase(value))
        {
            return true;
        }

        return false;
    }

The IGNORED_VALUES object is just a String[] with all ignored values in there.

回复收藏 0 原文

~没有更多了~