计算大列表中的不同元素太慢

发布于 2024-12-19 18:37:21 字数 651 浏览 3 评论 0原文

我有一个这样的列表（假设它存储在 summ.txt 中）：

s1   d2
s1   d4
s3   d2
s4   d1
s1   d3 
s4   d1
s5   d6
s3   d5
s1   d2

对于第一列 (s_) 中的每个元素，我需要获取第二列中不同元素的数量 ( d_）。在本例中：

我使用 shell 脚本来获取以下内容：

sor=`cat s.txt`

for d in $sor
do

n=$( grep $d ./summ.txt | cut -f2 | sort -u | wc -l)
echo $d, $n

done

其中 s.txt 是包含所有不同 s_ 的文件。在这种情况下，它将是：

s1
s2
s3
s4
s5

我知道这种方法有效，因为我已经尝试过。主要问题是主列表（summ.txt）由大约 1900 万个元素组成，不同 s_ 的数量约为 300 万个，因此计算所有元素会花费太多时间。你能建议一个更快的算法吗？

原文

I have a list like this (let's say it is memorized in summ.txt):

s1   d2
s1   d4
s3   d2
s4   d1
s1   d3 
s4   d1
s5   d6
s3   d5
s1   d2

I need to obtain, for every element in the first column (s_) the number of distinct element on the second one (d_). In this case:

I'm using a shell script to obtain this:

sor=`cat s.txt`

for d in $sor
do

n=$( grep $d ./summ.txt | cut -f2 | sort -u | wc -l)
echo $d, $n

done

Where s.txt is the files that contains all the different s_. In this case it will be:

s1
s2
s3
s4
s5

I know that this approach works because I've tried it. The main problem is that the main list (summ.txt) is made of about 19 milion elements and the number of different s_ is about 3 milion, so it will take too much time to compute all. Can you suggest a faster algorithm?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏末 2024-12-26 18:37:21

排序步骤为 O(n lg n)，可以避免，而采用线性时间算法。这是一个 Python 版本：（

distinct_values = defaultdict(set)  # hashmap of keys to hashsets of values
for line in sys.stdin:
    key, val = line.split()
    distinct_values[key].add(val)

for key, values in distinct_values.iteritems():
    print key, len(values)

排序输出可以在 O(k lg k) 额外时间内获得，其中 k 是 k 的数量>不同的键。）

The sorting step is O(n lg n) and can be avoided in favor of a linear-time algorithm. Here's a Python version:

distinct_values = defaultdict(set)  # hashmap of keys to hashsets of values
for line in sys.stdin:
    key, val = line.split()
    distinct_values[key].add(val)

for key, values in distinct_values.iteritems():
    print key, len(values)

(Sorted output can be obtained in O(k lg k) extra time, where k is the number of distinct keys.)

回复收藏 0 原文

不知所踪 2024-12-26 18:37:21

不要为每个 s_ 遍历一次文件，而是一次全部完成：

sort -u | cut -f 1 | uniq -c | awk '{ print $2","$1 }'

应用于您的示例数据，这给出：

s1,3
s3,2
s4,1
s5,1

此答案中完成的处理与每个 完成的处理大致相同>s_ 在问题中的 shell 脚本中。因此，我预计速度会提高约 300 万倍。

Rather than going through the file once for each s_, do them all at once:

sort -u | cut -f 1 | uniq -c | awk '{ print $2","$1 }'

Applied to your sample data, this gives:

s1,3
s3,2
s4,1
s5,1

The processing done in this answer is about the same as that done for each s_ in the shell script in the question. Thus, I'd expect a speedup by a factor of about 3 million.

回复收藏 0 原文

南汐寒笙箫 2024-12-26 18:37:21

使用数据库管理系统？

或者...

sort <input_file | awk -f counter.awk

#!/usr/bin/awk

// {
    if ($1!=prevfirstkey) {
       dump();
       prevfirstkey=$1;
       prevnextkey=$2;
       count=1;
    } else if ($2 != prevnextkey) {
       prevnextkey=$2;
       count++;
    }
}
dump() {
    print prevfirstkey " has " count " values";
    count=0;
}
END {
    dump();
}

Use a DBMS?

Or...

sort <input_file | awk -f counter.awk

#!/usr/bin/awk

// {
    if ($1!=prevfirstkey) {
       dump();
       prevfirstkey=$1;
       prevnextkey=$2;
       count=1;
    } else if ($2 != prevnextkey) {
       prevnextkey=$2;
       count++;
    }
}
dump() {
    print prevfirstkey " has " count " values";
    count=0;
}
END {
    dump();
}

回复收藏 0 原文

~没有更多了~