识别包含 300k+ 的列表中的重复元素弦乐

发布于 2024-12-25 14:42:02 字数 1335 浏览 6 评论 0原文

我有一个包含 305899 个字符串的列表(这是网站的用户名)。删除所有重复项后,数字降至 172123 个字符串。

我想找出特定字符串(用户名)在该 ArrayList 中重复了多少次。我写了一个简单的冒泡排序类型逻辑,但是太慢了。

private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
    Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
    int duplicate = 0;
    int size = userNameList.size();
    for (int i = 0; i < size - 1; i++) {
        duplicate = 0;
        for (int j = i + 1; j < size; j++) {
            if (userNameList.get(i).equals(userNameList.get(j))) {
                duplicate++;
                userNameList.remove(j);
                j--;
                size--;

            }
        }
        numberOfPosts.put(userNameList.get(i), duplicate);
    }

    return numberOfPosts;
}

然后我把它改成这样:

private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
    Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();

    Set<String> unique = new HashSet<String>(userNameList);

    for (String key : unique) {
        numberOfPosts.put(key, Collections.frequency(userNameList, key));
    }

    return numberOfPosts;
}

这也很慢。当我说慢时,需要 30 多分钟才能浏览完列表。

有没有其他有效的方法来处理这个问题?只是减少查找和计算重复元素所需的时间?

I have a list containing 305899 Strings (which is the username for a website). After I remove all the duplicates, the number goes down to 172123 Strings.

I want to find how many times a particular String (the username) is repeated in that ArrayList. I wrote a simple bubble sort type logic but it was too slow.

private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
    Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
    int duplicate = 0;
    int size = userNameList.size();
    for (int i = 0; i < size - 1; i++) {
        duplicate = 0;
        for (int j = i + 1; j < size; j++) {
            if (userNameList.get(i).equals(userNameList.get(j))) {
                duplicate++;
                userNameList.remove(j);
                j--;
                size--;

            }
        }
        numberOfPosts.put(userNameList.get(i), duplicate);
    }

    return numberOfPosts;
}

Then I changed it to this:

private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
    Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();

    Set<String> unique = new HashSet<String>(userNameList);

    for (String key : unique) {
        numberOfPosts.put(key, Collections.frequency(userNameList, key));
    }

    return numberOfPosts;
}

This was really slow as well. When I mean slow, it would take like 30+ minutes to through the list.

Is there any other efficient way to handle this problem? Just reduce the time it takes to find and count duplicate elements?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

烂柯人 2025-01-01 14:42:02

您的 findNumberOfPosts 方法是在正确的轨道上,但您的实现正在执行大量不必要的工作。
试试这个:

private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
    Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();

    for (String userName : userNameList) {
        Integer count = numberOfPosts.get(userName);
        numberOfPosts.put(userName, count == null ? 1 : ++count);
    }
    return numberOfPosts;
}

在大多数机器上这应该在几秒钟内执行。

Your findNumberOfPosts method is on the right track, but your implementation is doing loads of unnecessary work.
Try this:

private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
    Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();

    for (String userName : userNameList) {
        Integer count = numberOfPosts.get(userName);
        numberOfPosts.put(userName, count == null ? 1 : ++count);
    }
    return numberOfPosts;
}

This should execute in a couple of seconds on most machines.

安静 2025-01-01 14:42:02

看看第二种方法的这种变体是否运行得更快:

private static Map<String, Integer> findNumberOfPosts(
        List<String> userNameList) {
    Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();

    for (String name : userNameList) {
        Integer count = numberOfPosts.get(name);
        numberOfPosts.put(name, count == null ? 1 : (1 + count));
    }

    return numberOfPosts;
}

它有一些装箱/拆箱开销,但运行速度应该比您正在做的快得多,您正在做的事情需要迭代每个唯一名称的整个名称列表。

See if this variation of your second method works faster:

private static Map<String, Integer> findNumberOfPosts(
        List<String> userNameList) {
    Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();

    for (String name : userNameList) {
        Integer count = numberOfPosts.get(name);
        numberOfPosts.put(name, count == null ? 1 : (1 + count));
    }

    return numberOfPosts;
}

It has some boxing/unboxing overhead, but should operate a lot faster than what you were doing, which required iterating over the entire list of names for each unique name.

如果没有 2025-01-01 14:42:02

您可以尝试根据用户名构建 Trie 结构。那么找到不同元素(用户名)的数量就很简单了。 Trie 的代码有点复杂,因此您最好查找资源以了解如何实现。

另一方面,考虑到实际情况,您一开始就不应该有这个重复的列表。我的意思是,如果提供用户名的系统设计得当,那么首先就不会存在重复项。

You could attempt to build a Trie structure out of the usernames. Then it would be trivial to find the number of distinct elements(username). The code for Trie is little bit complicated, so you better look up resources to see how the implementation can be done.

On other thought, considering the practical scenario, you should not have this duplicate list in the first place. I mean, if the system providing the username was properly designed, then duplicates wouldn't exist in the first place.

我喜欢麦丽素 2025-01-01 14:42:02

这比 Bohemian 的速度还要快:

private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {

        Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();

        for (String userName : userNameList) {
            if (!numberOfPosts.containsKey(userName)) {
                numberOfPosts.put(userName, Collections.frequency(userNameList, userName));
            }
        }

        return numberOfPosts;
    }

This goes even faster than Bohemian's:

private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {

        Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();

        for (String userName : userNameList) {
            if (!numberOfPosts.containsKey(userName)) {
                numberOfPosts.put(userName, Collections.frequency(userNameList, userName));
            }
        }

        return numberOfPosts;
    }
伴我老 2025-01-01 14:42:02

最好的解决方案是将所有元素添加到数组中,然后对该数组进行排序。

然后你可以迭代数组,重复项将被放置在数组中彼此相邻的位置。

The best solution is to add all the elements to an Array and then sort that array.

Then you can just iterate over the array and the duplicates will be placed next to each other in the array.

罪歌 2025-01-01 14:42:02

您应该尝试改进第一个实现:对于每个条目,您都将迭代整个列表。怎么样:

Map<String, Integer> map;
for (String username : usernames) {
    if (!map.containsKey(username)) {
        map.put(username, new Integer(0));
    } else {
        map.put(username, new Integer(map.get(username).intValue() + 1));
    }
}
return map;

You should try improving the first implementation: for each entry you're iterating through the entire list. How about something like:

Map<String, Integer> map;
for (String username : usernames) {
    if (!map.containsKey(username)) {
        map.put(username, new Integer(0));
    } else {
        map.put(username, new Integer(map.get(username).intValue() + 1));
    }
}
return map;
因为看清所以看轻 2025-01-01 14:42:02

使用旨在本地支持此操作的数据结构。将用户名存储在 Multiset 中并让它自动为您维护频率/计数。

阅读本教程了解多重集的工作原理/

Use the data structure that was designed to support this natively. Store the user names in a Multiset and let it automatically maintain the frequency/count for you.

Read this tutorial to understand how multiset works/

-柠檬树下少年和吉他 2025-01-01 14:42:02

以下是删除重复项并计算列表中重复元素数量的最佳且方便的方法。不需要有额外的逻辑。

List<String> userNameList = new ArrayList<String>();
// add elements to userNameList, including duplicates

userNameList.add("a");
userNameList.add("a");
userNameList.add("a");
userNameList.add("a");

userNameList.add("b");
userNameList.add("b");
userNameList.add("b");
userNameList.add("b");

userNameList.add("c");
userNameList.add("c");
userNameList.add("c");
userNameList.add("c");

int originalSize=userNameList.size();

HashSet hs = new HashSet();   //Set would handle the duplicates automatically.
hs.addAll(userNameList);
userNameList.clear();
userNameList.addAll(hs);

Collections.sort(userNameList);  //Sort the List, if needed.

//Displays elements after removing duplicate entries.
for(Object element:userNameList)
{
    System.out.println(element);
}

int duplicate=originalSize-userNameList.size();

System.out.println("Duplicate entries in the List:->"+duplicate); //Number of duplicate entries.

 /*Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();   //Store duplicate entries in your Map using some key.
 numberOfPosts.put(userNameList.get(i), duplicate);

 return(numberOfPosts);*/

The following is the best and convenient method to remove duplicates and count the number of duplicate elements in a List. No need to have extra logic.

List<String> userNameList = new ArrayList<String>();
// add elements to userNameList, including duplicates

userNameList.add("a");
userNameList.add("a");
userNameList.add("a");
userNameList.add("a");

userNameList.add("b");
userNameList.add("b");
userNameList.add("b");
userNameList.add("b");

userNameList.add("c");
userNameList.add("c");
userNameList.add("c");
userNameList.add("c");

int originalSize=userNameList.size();

HashSet hs = new HashSet();   //Set would handle the duplicates automatically.
hs.addAll(userNameList);
userNameList.clear();
userNameList.addAll(hs);

Collections.sort(userNameList);  //Sort the List, if needed.

//Displays elements after removing duplicate entries.
for(Object element:userNameList)
{
    System.out.println(element);
}

int duplicate=originalSize-userNameList.size();

System.out.println("Duplicate entries in the List:->"+duplicate); //Number of duplicate entries.

 /*Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();   //Store duplicate entries in your Map using some key.
 numberOfPosts.put(userNameList.get(i), duplicate);

 return(numberOfPosts);*/
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文