在Java中检测字符串中重复字符的最有效方法是什么?

发布于 2024-10-27 07:55:13 字数 1333 浏览 1 评论 0原文

使用数据结构(HashMap)我能够做到这一点。

这是代码:

import java.util.*;

class unique{
    public static void main(String[] args){
        HashMap<Character, Integer> charMap = new HashMap<Character, Integer>();
        boolean isNotUnique = false;
            for ( char loop : args[0].toCharArray() ){
            Integer freq = charMap.get( loop );
            charMap.put( loop, ( freq == null )? 1 : freq+1 );
            if ( charMap.get( loop ) > 1 )
            {
                isNotUnique = true;
            }
        }
            System.out.println ( isNotUnique );
    }
}

没有数据结构,我想出了一个生硬的方法。 这有 O(n^2)

class unique
{
    public static void main(String[] args)
    {
        String inputString = args[0];
        System.out.println( isUnique( inputString ) );

    }

    private static boolean isUnique(String inputString) {
        String methodString = inputString;
        for ( int i = 0; i < inputString.length(); i++ )
        {
            for ( int j = i+1; j < inputString.length(); j++ )
            {
                if ( methodString.charAt( i ) == methodString.charAt( j ) )
                {
                    return false;
                }
            }
        }
        return true;
    }
}

我想知道是否可以在 O(n) 时间复杂度内解决

Using data structures (HashMap) I was able to do it.

This is the code:

import java.util.*;

class unique{
    public static void main(String[] args){
        HashMap<Character, Integer> charMap = new HashMap<Character, Integer>();
        boolean isNotUnique = false;
            for ( char loop : args[0].toCharArray() ){
            Integer freq = charMap.get( loop );
            charMap.put( loop, ( freq == null )? 1 : freq+1 );
            if ( charMap.get( loop ) > 1 )
            {
                isNotUnique = true;
            }
        }
            System.out.println ( isNotUnique );
    }
}

Without datastructures, I came up with a blunt approach.
This has O(n^2)

class unique
{
    public static void main(String[] args)
    {
        String inputString = args[0];
        System.out.println( isUnique( inputString ) );

    }

    private static boolean isUnique(String inputString) {
        String methodString = inputString;
        for ( int i = 0; i < inputString.length(); i++ )
        {
            for ( int j = i+1; j < inputString.length(); j++ )
            {
                if ( methodString.charAt( i ) == methodString.charAt( j ) )
                {
                    return false;
                }
            }
        }
        return true;
    }
}

I was wondering if it was possible to solve in O(n) time complexity

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

金兰素衣 2024-11-03 07:55:13

如果您需要支持不由代理字符对表示的 Unicode 字符,可以这样做:

private static boolean isUnique(String inputString) {
    long[] used = new long[1024];
    for (char c : inputString.toCharArray()) {
        if ((used[c >>> 6] & (1 << c)) > 0) {
            return false;
        }
        used[c >>> 6] |= 1 << c;
    }
    return true;
}

它使用位翻转来节省内存。从本质上讲,这与使用布尔数组是一样的:

private static boolean isUnique2(String inputString) {
    boolean[] used = new boolean[65536];
    for (char c : inputString.toCharArray()) {
        if (used[c]) {
            return false;
        }
        used[c] = true;
    }
    return true;
}

如果您只需要支持 ASCII 字符,则可以在任何一种情况下限制 used 的大小,以减少所需的内存(因此 long [4]boolean[256])。在 inputString 的特定长度以下,执行 n^2 检查可能比为此分配内存更快。因此,理想情况下,您可以根据长度将两者结合起来。

如果您需要支持所有可能的 Unicode 字符,则必须修改它以支持代理字符对。您可以使用 Character.isHighSurrogate(c) 检测它们。请参阅此页面了解一些帮助并搜索 Google 了解更多详细信息。

If you need to support Unicode characters that aren't represented by surrogate char pairs, this will do it:

private static boolean isUnique(String inputString) {
    long[] used = new long[1024];
    for (char c : inputString.toCharArray()) {
        if ((used[c >>> 6] & (1 << c)) > 0) {
            return false;
        }
        used[c >>> 6] |= 1 << c;
    }
    return true;
}

It's using bit flips to save memory. It's essentially the same thing as if you used an array of booleans:

private static boolean isUnique2(String inputString) {
    boolean[] used = new boolean[65536];
    for (char c : inputString.toCharArray()) {
        if (used[c]) {
            return false;
        }
        used[c] = true;
    }
    return true;
}

If you only need to support ASCII characters you could limit the size of used in either case to reduce the memory required (so long[4] and boolean[256]). Below a certain length of inputString it's probably faster to do the n^2 check than allocate the memory for this though. So ideally you do a combination of the two based on the length.

If you need to support all possible Unicode characters you'll have to modify this to support surrogate char pairs. You can detect them with Character.isHighSurrogate(c). See this page for some help and search Google for more details.

素手挽清风 2024-11-03 07:55:13

性格的定义是什么? az AZ 还是全部 unicode?

如果字符串的长度很大,比如一百万,你可以构建一个int数组,数组的长度是你的字符集的长度,并且数组将被初始化为零。

之后,根据每个字符遍历字符串:array[(int)char]++,这样,您就可以轻松地从数组中找到字符出现的时间。

然而,短字符串不需要这样的方法。

该方法的复杂度为 O(n)

What is the definition of character? a-z A-Z or all unicode?

if the length of string is quite big, such as one million, you could build an int array, the length of array is the length of you character set and the array would be initialized with zero.

after this, go through the string, according to each charactor : array[(int)char]++, so that, u can easily find the time of character appearing from the array.

however, short string won't need such method.

this method is O(n)

爱的那么颓废 2024-11-03 07:55:13

我想知道是否可以在 O(n) 时间复杂度内解决:

有两个简单的解决方案,时间复杂度为 O(N):

  • HashSet方法在时间上为 O(N),在空间上为 O(N),其中 N 是字符串长度。 (包含 N 个不同字符的常规 Java HashSet 将占用 O(N) 空间,并且具有相对较大的比例常数。)

  • 位数组方法在时间上为 O(N),在空间上为 O(1),但 O(1) 为 8K 字节(或 64K 字节(如果使用 boolean[]),并且将添加到时间中的这么多内存归零会产生相关成本。

这些都不是所有情况下的最佳答案。

  • 对于足够小的N,简单的O(N^2)嵌套循环将是最快的。 (并且它不使用额外的内存。)

  • 对于中等大小的 N,在冲突时使用重新哈希的自定义哈希表将比 HashSet 或位数组更好方法。 HashSet 方法比位数组方法更好。

  • 对于足够大的N,位数组方法将是最快的并且使用最少的内存。

(对于上述内容,我假设输入字符串不包含任何重复字符。如果包含任何重复字符,则 N 的实际值将小于字符串长度。)


如果重复字符检测需要处理 UTF-16 代理对,那么简单的方法是即时转码为 Unicode 代码点,并更改数据结构以使用 HashSet、更大的位数组等。

相反,如果可以限制输入字符集的大小,则可以减小位数组的大小等。

这些调整将对小/中/大阈值可能下降的位置产生很大影响。

I was wondering if it was possible to solve in O(n) time complexity:

There are two simple solutions that are O(N) in time:

  • The HashSet approach is O(N) in time and O(N) in space, where N is the String length. (A regular Java HashSet containing N distinct characters will take O(N) space, with a relatively large constant of proportionality.)

  • The bit array approach is O(N) in time and O(1) in space, but the O(1) is 8K bytes (or 64K bytes if you use a boolean[]) and there is an associated cost of zeroing that much memory added to the time.

Neither of these is the best answer in all cases.

  • For small enough N, a simple O(N^2) nested loop will be be fastest. (And it uses no extra memory.)

  • For medium-sized N, a custom hash table that uses rehashing on collision will be better than HashSet or the bit array approach. The HashSet approach will be better than the bit array approach.

  • For large enough N, the bit array approach will be fastest and use the least memory.

(For the above, I'm assuming that input strings don't contain any duplicate characters. If they do, then the actual value of N will be less than the string length.)


If duplicate character detection needs to cope with UTF-16 surrogate pairs, then the simple approach is to transcode on the fly to Unicode codepoints, and change the data structures to use a HashSet<Integer>, bigger bit arrays, etc.

Conversely, if you can limit the size of the input character set, you can reduce the size of the bit arrays, etc.

These tweaks would make a big difference to where the small / medium / large thresholds are likely to fall.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文