ruby 中的大数组操作非常慢
我有以下场景:
我需要找出一个非常大的集合中唯一的 id 列表。
例如,我有 6000 个 ids 数组(关注者列表),每个数组的大小范围在 1 到 25000 之间(他们的关注者列表)。
我想获取所有这些 id 数组中的唯一 id 列表(关注者的唯一关注者)。完成后,我需要减去另一个 id 列表(另一个人的关注者列表)并获得最终计数。
最终的唯一 ID 集增长到大约 60,000,000 条记录。在 ruby 中,当将数组添加到大数组时,它开始变得非常慢,大约有几百万。添加到集合一开始需要 0.1 秒,然后增长到 200 万时需要 4 秒以上(离我需要去的地方还很远)。
我用java编写了一个测试程序,它在不到一分钟的时间内完成了整个过程。
也许我在红宝石中这样做效率低下,或者还有另一种方法。由于我的主要代码是专有的,所以我编写了一个简单的测试程序来模拟该问题:
big_array = []
loop_counter = 0
start_time = Time.now
# final target size of the big array
while big_array.length < 60000000
loop_counter+=1
# target size of one persons follower list
random_size_of_followers = rand(5000)
follower_list = []
follower_counter = 0
while follower_counter < random_size_of_followers
follower_counter+=1
# make ids very large so we get good spread and only some amt of dupes
follower_id = rand(240000000) + 100000
follower_list << follower_id
end
# combine the big list with this list
big_array = big_array | follower_list
end_time = Time.now
# every 100 iterations check where we are and how long each loop and combine takes.
if loop_counter % 100 == 0
elapsed_time = end_time - start_time
average_time = elapsed_time.to_f/loop_counter.to_f
puts "average time for loop is #{average_time}, total size of big_array is #{big_array.length}"
start_time = Time.now
end
end
有什么建议吗?是时候切换到 jruby 并将此类内容移动到 java 了吗?
I have the following scenario:
I need to figure out the unique list of ids across a very large set.
So for example, I have 6000 arrays of ids (followers list), each can range in size between 1 and 25000 (their follower list).
I want to get the unique list of ids across all of these arrays of ids (unique followers of followers). Once that is done I need to subtract out another list (another persons follower list) of ids and get a final count.
The final set of unique ids grows to around 60,000,000 records. In ruby when adding the arrays to the big array, it starts to get very slow around a couple of million. Adding to the set takes .1 seconds at first, then grows to taking over 4 seconds at 2 million (no where near where I need to go).
I wrote a test program in java and it does the whole thing in less than a minute.
Maybe I am doing this inefficiently in ruby, or there is another way. Since my main code is proprietary I have wrote a simple test program to simulate the issue:
big_array = []
loop_counter = 0
start_time = Time.now
# final target size of the big array
while big_array.length < 60000000
loop_counter+=1
# target size of one persons follower list
random_size_of_followers = rand(5000)
follower_list = []
follower_counter = 0
while follower_counter < random_size_of_followers
follower_counter+=1
# make ids very large so we get good spread and only some amt of dupes
follower_id = rand(240000000) + 100000
follower_list << follower_id
end
# combine the big list with this list
big_array = big_array | follower_list
end_time = Time.now
# every 100 iterations check where we are and how long each loop and combine takes.
if loop_counter % 100 == 0
elapsed_time = end_time - start_time
average_time = elapsed_time.to_f/loop_counter.to_f
puts "average time for loop is #{average_time}, total size of big_array is #{big_array.length}"
start_time = Time.now
end
end
Any suggestions, is it time to switch to jruby and move stuff like this to java?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您在那里使用的方法效率极低,因此速度缓慢也就不足为奇了。当您尝试跟踪独特的事物时,数组比哈希等价物需要更多的处理。
这是一个简单的重构,可以将速度提高大约 100 倍:
哈希的好处是不可能有重复项。如果您需要随时列出所有关注者,请使用
all_followers.keys
获取 ID。哈希比数组占用更多的内存,但这是为了性能而必须付出的代价。我还怀疑这里最大的内存消耗者之一是生成的许多单独的关注者列表,但似乎从未使用过,所以也许您可以完全跳过该步骤。
这里的关键是数组
|
运算符效率不是很高,特别是在操作非常大的数组时。The method you're using there is horribly inefficient, so it's no surprise this is slow. When you're trying to keep track of unique things, an Array requires way more processing than a Hash equivalent.
Here's a simple refactoring that increases the speed about 100x:
The nice thing about a Hash is that it's impossible to have duplicates. If you need to list all the followers at any time, use
all_followers.keys
to get the IDs.Hashes take up more memory than their Array equivalents, but this is the price you have to pay for performance. I'd also suspect one of the big memory consumers here is the many individual lists of followers that are generated and seemingly never used, so perhaps you could skip that step entirely.
The key thing here is that the Array
|
operator is not very efficient, especially when operating on very large arrays.下面是一个使用数组、散列和集合处理唯一对象的示例:
及其基准:
参考:
ruby 处理unique 对象
Here is an example to handle unique objects with array, hash and set:
and their benchmark:
Refs:
ruby處理unique物件