当前位置：文江博客话题详情

ruby 中的大数组操作非常慢

发布于 2024-12-10 18:17:51 字数 1475 浏览 3 评论 0原文

我有以下场景：

我需要找出一个非常大的集合中唯一的 id 列表。

例如，我有 6000 个 ids 数组（关注者列表），每个数组的大小范围在 1 到 25000 之间（他们的关注者列表）。

我想获取所有这些 id 数组中的唯一 id 列表（关注者的唯一关注者）。完成后，我需要减去另一个 id 列表（另一个人的关注者列表）并获得最终计数。

最终的唯一 ID 集增长到大约 60,000,000 条记录。在 ruby 中，当将数组添加到大数组时，它开始变得非常慢，大约有几百万。添加到集合一开始需要 0.1 秒，然后增长到 200 万时需要 4 秒以上（离我需要去的地方还很远）。

我用java编写了一个测试程序，它在不到一分钟的时间内完成了整个过程。

也许我在红宝石中这样做效率低下，或者还有另一种方法。由于我的主要代码是专有的，所以我编写了一个简单的测试程序来模拟该问题：

big_array = []
loop_counter = 0
start_time = Time.now
# final target size of the big array
while big_array.length < 60000000
 loop_counter+=1
 # target size of one persons follower list
 random_size_of_followers = rand(5000)
 follower_list = []
 follower_counter = 0
   while follower_counter < random_size_of_followers
     follower_counter+=1
     # make ids very large so we get good spread and only some amt of dupes
     follower_id = rand(240000000) + 100000
     follower_list << follower_id
   end
 # combine the big list with this list
 big_array = big_array | follower_list
 end_time = Time.now

 # every 100 iterations check where we are and how long each loop and combine takes.
 if loop_counter % 100 == 0
   elapsed_time = end_time - start_time
   average_time = elapsed_time.to_f/loop_counter.to_f
   puts "average time for loop is #{average_time}, total size of big_array is #{big_array.length}"
   start_time = Time.now
 end
end

有什么建议吗？是时候切换到 jruby 并将此类内容移动到 java 了吗？

原文

I have the following scenario:

I need to figure out the unique list of ids across a very large set.

So for example, I have 6000 arrays of ids (followers list), each can range in size between 1 and 25000 (their follower list).

I want to get the unique list of ids across all of these arrays of ids (unique followers of followers). Once that is done I need to subtract out another list (another persons follower list) of ids and get a final count.

The final set of unique ids grows to around 60,000,000 records. In ruby when adding the arrays to the big array, it starts to get very slow around a couple of million. Adding to the set takes .1 seconds at first, then grows to taking over 4 seconds at 2 million (no where near where I need to go).

I wrote a test program in java and it does the whole thing in less than a minute.

Maybe I am doing this inefficiently in ruby, or there is another way. Since my main code is proprietary I have wrote a simple test program to simulate the issue:

big_array = []
loop_counter = 0
start_time = Time.now
# final target size of the big array
while big_array.length < 60000000
 loop_counter+=1
 # target size of one persons follower list
 random_size_of_followers = rand(5000)
 follower_list = []
 follower_counter = 0
   while follower_counter < random_size_of_followers
     follower_counter+=1
     # make ids very large so we get good spread and only some amt of dupes
     follower_id = rand(240000000) + 100000
     follower_list << follower_id
   end
 # combine the big list with this list
 big_array = big_array | follower_list
 end_time = Time.now

 # every 100 iterations check where we are and how long each loop and combine takes.
 if loop_counter % 100 == 0
   elapsed_time = end_time - start_time
   average_time = elapsed_time.to_f/loop_counter.to_f
   puts "average time for loop is #{average_time}, total size of big_array is #{big_array.length}"
   start_time = Time.now
 end
end

Any suggestions, is it time to switch to jruby and move stuff like this to java?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

颜漓半夏 2024-12-17 18:17:51

您在那里使用的方法效率极低，因此速度缓慢也就不足为奇了。当您尝试跟踪独特的事物时，数组比哈希等价物需要更多的处理。

这是一个简单的重构，可以将速度提高大约 100 倍：

all_followers = { }
loop_counter = 0
start_time = Time.now

while (all_followers.length < 60000000)
  # target size of one persons follower list
  follower_list = []

  rand(5000).times do
    follower_id = rand(240000000) + 100000
    follower_list << follower_id
    all_followers[follower_id] = true
  end

 end_time = Time.now

 # every 100 iterations check where we are and how long each loop and combine takes.
 loop_counter += 1

  if (loop_counter % 100 == 0)
    elapsed_time = end_time - start_time
    average_time = elapsed_time.to_f/loop_counter.to_f
    puts "average time for loop is #{average_time}, total size of all_followers is #{all_followers.length}"
    start_time = Time.now
  end
end

哈希的好处是不可能有重复项。如果您需要随时列出所有关注者，请使用 all_followers.keys 获取 ID。

哈希比数组占用更多的内存，但这是为了性能而必须付出的代价。我还怀疑这里最大的内存消耗者之一是生成的许多单独的关注者列表，但似乎从未使用过，所以也许您可以完全跳过该步骤。

这里的关键是数组 | 运算符效率不是很高，特别是在操作非常大的数组时。

The method you're using there is horribly inefficient, so it's no surprise this is slow. When you're trying to keep track of unique things, an Array requires way more processing than a Hash equivalent.

Here's a simple refactoring that increases the speed about 100x:

all_followers = { }
loop_counter = 0
start_time = Time.now

while (all_followers.length < 60000000)
  # target size of one persons follower list
  follower_list = []

  rand(5000).times do
    follower_id = rand(240000000) + 100000
    follower_list << follower_id
    all_followers[follower_id] = true
  end

 end_time = Time.now

 # every 100 iterations check where we are and how long each loop and combine takes.
 loop_counter += 1

  if (loop_counter % 100 == 0)
    elapsed_time = end_time - start_time
    average_time = elapsed_time.to_f/loop_counter.to_f
    puts "average time for loop is #{average_time}, total size of all_followers is #{all_followers.length}"
    start_time = Time.now
  end
end

The nice thing about a Hash is that it's impossible to have duplicates. If you need to list all the followers at any time, use all_followers.keys to get the IDs.

Hashes take up more memory than their Array equivalents, but this is the price you have to pay for performance. I'd also suspect one of the big memory consumers here is the many individual lists of followers that are generated and seemingly never used, so perhaps you could skip that step entirely.

The key thing here is that the Array | operator is not very efficient, especially when operating on very large arrays.

回复收藏 0 原文

寄与心 2024-12-17 18:17:51

下面是一个使用数组、散列和集合处理唯一对象的示例：

require 'benchmark'
require 'set'
require 'random_token'

n = 10000

Benchmark.bm(7) do |x|
  x.report("array:") do
    created_tokens = []
    while created_tokens.size < n
      token = RandomToken.gen(10)
      if created_tokens.include?(token)
        next
      else
        created_tokens << token
      end
    end
    results = created_tokens
  end

  x.report("hash:") do
    created_tokens_hash = {}
    while created_tokens_hash.size < n
      token = RandomToken.gen(10)
      created_tokens_hash[token] = true
    end
    results = created_tokens_hash.keys
  end

  x.report("set:") do
    created_tokens_set = Set.new
    while created_tokens_set.size < n
      token = RandomToken.gen(10)
      created_tokens_set << token
    end
    results = created_tokens_set.to_a
  end
end

及其基准：

              user     system      total        real
array:    8.860000   0.050000   8.910000 (  9.112402)
hash:     2.030000   0.010000   2.040000 (  2.062945)
set:      2.000000   0.000000   2.000000 (  2.037125)

参考：

ruby 处理unique 对象

Here is an example to handle unique objects with array, hash and set:

require 'benchmark'
require 'set'
require 'random_token'

n = 10000

Benchmark.bm(7) do |x|
  x.report("array:") do
    created_tokens = []
    while created_tokens.size < n
      token = RandomToken.gen(10)
      if created_tokens.include?(token)
        next
      else
        created_tokens << token
      end
    end
    results = created_tokens
  end

  x.report("hash:") do
    created_tokens_hash = {}
    while created_tokens_hash.size < n
      token = RandomToken.gen(10)
      created_tokens_hash[token] = true
    end
    results = created_tokens_hash.keys
  end

  x.report("set:") do
    created_tokens_set = Set.new
    while created_tokens_set.size < n
      token = RandomToken.gen(10)
      created_tokens_set << token
    end
    results = created_tokens_set.to_a
  end
end

and their benchmark:

              user     system      total        real
array:    8.860000   0.050000   8.910000 (  9.112402)
hash:     2.030000   0.010000   2.040000 (  2.062945)
set:      2.000000   0.000000   2.000000 (  2.037125)

Refs:

ruby處理unique物件

回复收藏 0 原文

~没有更多了~