在 ruby​​ 中一次读取文件 N 行

发布于 2024-08-26 12:28:08 字数 136 浏览 7 评论 0原文

我有一个大文件(数百兆),其中包含文件名,每行一个。

我需要循环遍历文件名列表,并为每个文件名分叉一个进程。我一次最多需要 8 个分叉进程,并且不想一次将整个文件名列表读入 RAM。

我什至不知道从哪里开始,有人可以帮助我吗?

I have a large file (hundreds of megs) that consists of filenames, one per line.

I need to loop through the list of filenames, and fork off a process for each filename. I want a maximum of 8 forked processes at a time and I don't want to read the whole filename list into RAM at once.

I'm not even sure where to begin, can anyone help me out?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

离鸿 2024-09-02 12:28:08
File.foreach("large_file").each_slice(8) do |eight_lines|
  # eight_lines is an array containing 8 lines.
  # at this point you can iterate over these filenames
  # and spawn off your processes/threads
end
File.foreach("large_file").each_slice(8) do |eight_lines|
  # eight_lines is an array containing 8 lines.
  # at this point you can iterate over these filenames
  # and spawn off your processes/threads
end
仅一夜美梦 2024-09-02 12:28:08

听起来 流程模块 对于此任务很有用。以下是我快速拼凑起来的内容作为起点:

include Process

i = 0
for line in open('files.txt') do
    i += 1
    fork { `sleep #{rand} && echo "#{i} - #{line.chomp}" >> numbers.txt` }

    if i >= 8
        wait # join any single child process
        i -= 1
    end
end

waitall # join all remaining child processes

输出:

hello
goodbye

test1
test2
a
b
c
d
e
f
g
$ ruby b.rb
$ cat numbers.txt 
1 - hello
3 - 
2 - goodbye
5 - test2
6 - a
4 - test1
7 - b
8 - c
8 - d
8 - e
8 - f
8 - g

其工作方式是:

  • for line in open(XXX) 将惰性地迭代您指定的文件的行。
  • fork 将生成一个执行的子进程给定的块,在本例中,我们使用反引号来指示 shell 要执行的内容。请注意,rand 在这里返回一个值 0-1,因此我们睡眠的时间不到一秒,我调用 line.chomp 来删除从 获得的尾随换行符>行。
  • 如果我们积累了 8 个或更多进程,请调用 wait 停止一切,直到其中一个返回。
  • 最后,在循环之外,调用 waitall 在退出脚本之前加入所有剩余进程。

It sounds like the Process module will be useful for this task. Here's something I quickly threw together as a starting point:

include Process

i = 0
for line in open('files.txt') do
    i += 1
    fork { `sleep #{rand} && echo "#{i} - #{line.chomp}" >> numbers.txt` }

    if i >= 8
        wait # join any single child process
        i -= 1
    end
end

waitall # join all remaining child processes

Output:

hello
goodbye

test1
test2
a
b
c
d
e
f
g
$ ruby b.rb
$ cat numbers.txt 
1 - hello
3 - 
2 - goodbye
5 - test2
6 - a
4 - test1
7 - b
8 - c
8 - d
8 - e
8 - f
8 - g

The way this works is that:

  • for line in open(XXX) will lazily iterate over the lines of the file you specify.
  • fork will spawn a child process executing the given block, and in this case, we use backticks to indicate something to be executed by the shell. Note that rand returns a value 0-1 here so we are sleeping less than a second, and I call line.chomp to remove the trailing newline that we get from line.
  • If we've accumulated 8 or more processes, call wait to stop everything until one of them returns.
  • Finally, outside the loop, call waitall to join all remaining processes before exiting the script.
东京女 2024-09-02 12:28:08

这是 Mark 的解决方案,封装为 ProcessPool 类,可能会有所帮助(如果我犯了一些错误,请纠正我):

class ProcessPool
  def initialize pool_size
    @pool_size = pool_size
    @free_slots = @pool_size
  end

  def fork &p
    if @free_slots == 0
      Process.wait
      @free_slots += 1
    end
    @free_slots -= 1
    puts "Free slots: #{@free_slots}"
    Process.fork &p
  end

  def waitall
    Process.waitall
  end
end

pool = ProcessPool.new 8
for line in open('files.txt') do
  pool.fork { Kernel.sleep rand(10); puts line.chomp }
end
pool.waitall
puts 'finished'

Here's Mark's solution wrapped up as a ProcessPool class, might be helpful to have it around (and please correct me if I made some mistake):

class ProcessPool
  def initialize pool_size
    @pool_size = pool_size
    @free_slots = @pool_size
  end

  def fork &p
    if @free_slots == 0
      Process.wait
      @free_slots += 1
    end
    @free_slots -= 1
    puts "Free slots: #{@free_slots}"
    Process.fork &p
  end

  def waitall
    Process.waitall
  end
end

pool = ProcessPool.new 8
for line in open('files.txt') do
  pool.fork { Kernel.sleep rand(10); puts line.chomp }
end
pool.waitall
puts 'finished'
淡淡の花香 2024-09-02 12:28:08

Queue 的标准库文档

require 'thread'

queue = Queue.new

producer = Thread.new do
  5.times do |i|
    sleep rand(i) # simulate expense
    queue << i
    puts "#{i} produced"
  end
end

consumer = Thread.new do
  5.times do |i|
    value = queue.pop
    sleep rand(i/2) # simulate expense
    puts "consumed #{value}"
  end
end

consumer.join

我确实找到了 虽然有点冗长。

维基百科将此描述为线程池模式

The standard library documentation for Queue has

require 'thread'

queue = Queue.new

producer = Thread.new do
  5.times do |i|
    sleep rand(i) # simulate expense
    queue << i
    puts "#{i} produced"
  end
end

consumer = Thread.new do
  5.times do |i|
    value = queue.pop
    sleep rand(i/2) # simulate expense
    puts "consumed #{value}"
  end
end

consumer.join

I do find it a little verbose though.

Wikipedia describes this as a thread pool pattern

日裸衫吸 2024-09-02 12:28:08

arr = IO.readlines("文件名")

arr = IO.readlines("filename")

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文