多个工作脚本处理数据列表的模式
假设我有一个 10,000 行字符串的列表,需要由 100 个工作脚本处理。
我希望 100 个脚本中的尽可能多的脚本能够同步运行。
一旦工作脚本完成一行,它应该处理当前未被另一个工作脚本处理的下一个可用行。
如果工作脚本在某一行失败,它将跳过该行并移至当前未被另一个工作脚本处理的下一个可用行。
工作脚本随时可能在未知的时间内不可用。
现在假设在最初的 100 个工作脚本中,任何给定的工作脚本都可能变得不可用(崩溃或当前数据花费太长时间),但在一段时间后将再次可用。它可能会再次变得不可用,并且可能需要很长时间才能在处理 10,000 行的持续时间内再次变得可用。
如何处理所有 10,000 行,初始 100 个工作脚本同步运行,但其中任何一个脚本都可能变得不可用,并且在一段未知的随机时间之后,它可能再次可用以准备处理。
我想像对所有 10,000 行执行一个循环,以及另一个脚本来定期轮询所有可用的工作程序,并同步启动这些工作程序。
我不确定如何解决这个问题。
Say I have a list of 10,000 lines of string that needs to be processed by 100 worker scripts.
I would like as many of the 100 scripts to run synchronously as possible.
Once a worker script is finished with one line, it should process the next available line that is not currently being processed by another worker script.
If a worker script fails on a line, it will skip it and move on to the next available line that is not currently being processed by another worker script.
A worker script at anytime may be unavailable for unknown amount of time.
Now assume that out of the first initial 100 worker scripts, any given worker script may become unavailable (either crashing or taking too long with a current data) but will become available again after some time down the road. It may become unavailble again and may take too long to become available again for the duration of processing 10,000 lines.
How to process all 10,000 lines with initial 100 worker scripts synchronously running but any of which may become unavailable and after some unknown random time, it could become available again ready to process.
I would imagine something like doing a loop for all 10,000 lines, and another script to poll all available workers at intervals, and launch those workers synchronously.
I am uncertain how I would approach this problem.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
生产者/消费者模式对于这种情况非常有帮助。我在这里对此进行了更多解释。
也就是说,如果您的情况确实如此简单,那么更简单的技术可能更合适,例如均匀分区数据。
另外,我假设您不会期望看到 100 倍的加速,因为您的硬件肯定不会支持这一点...
当然,如果我完全误解了并且您实际上想要将每个字符串处理 100 倍(即每个脚本都会执行某些操作)不同),那么请澄清。
The producer/consumer pattern is pretty helpful for situations like this. I explained it a bit more over here.
That said, if your situation is really that straightforward, simpler techniques may be more appropriate, like partitioning the data evenly.
Also, I assume that you're not expecting to see a 100x speedup as your HW surely wouldn't support that...
Of course if I've completely misunderstood and you actually want to process each string 100x (i.e. each script does something different), then please clarify.