使用 python 处理磁盘中的大量数据最有效的方法是什么？

发布于 2024-10-08 18:16:00 字数 1825 浏览 5 评论 0原文

我正在编写一个简单的 python 脚本来读取和重建失败的 RAID5 阵列中的数据，而我无法以任何其他方式重建该阵列。我的脚本正在运行，但速度很慢。我的原始脚本的运行速度约为 80MB/分钟。此后我改进了脚本，它的运行速度为 550MB/分钟，但这似乎仍然有点低。 python 脚本占用 100% CPU，因此它似乎是受 CPU 而不是磁盘限制，这意味着我有优化的机会。因为脚本根本不是很长，所以我无法有效地分析它，所以我不知道是什么吞噬了它。这是我现在的脚本（或者至少是重要的部分），

disk0chunk = disk0.read(chunkSize)
#disk1 is missing, bad firmware
disk2chunk = disk2.read(chunkSize)
disk3chunk = disk3.read(chunkSize)
if (parityDisk % 4 == 1): #if the parity stripe is on the missing drive
  output.write(disk0chunk + disk2chunk + disk3chunk)
else: #we need to rebuild the data in disk1
  # disk0num = map(ord, disk0chunk) #inefficient, old code
  # disk2num = map(ord, disk2chunk) #inefficient, old code
  # disk3num = map(ord, disk3chunk) #inefficient, old code
  disk0num = struct.depack("16384l", disk0chunk) #more efficient new code
  disk2num = struct.depack("16384l", disk2chunk) #more efficient new code
  disk3num = struct.depack("16384l", disk3chunk) #more efficient new code
  magicpotato = zip(disk0num,disk2num,disk3num)
  disk1num = map(takexor, magicpotato)
  # disk1bytes = map(chr, disk1num) #inefficient, old code
  # disk1chunk = ''.join(disk1bytes) #inefficient, old code
  disk1chunk = struct.pack("16384l", *disk1num) #more efficient new code

  #output nonparity to based on parityDisk

def takexor(magicpotato):
  return magicpotato[0]^magicpotato[1]^magicpotato[2]

用粗体表示这个巨大文本块中的实际问题：

我可以做些什么来让这个更快/更好吗？如果我什么也没想到，我能做些什么来更好地研究是什么让事情进展缓慢？（是否有一种方法可以在每行级别上分析 python？）我是否以正确的方式处理这个问题，或者是否有更好的方法来处理大量的二进制数据？

我问的原因是我有重建 3TB 驱动器，即使它工作正常（我可以很好地挂载映像、循环和浏览文件），也需要很长时间。我用旧代码测量需要到一月中旬，现在需要到圣诞节（所以它方式更好，但仍然比我预期的要慢。）

在你问之前，这是一个 mdadm RAID5（64kb 块大小，左对称），但 mdadm 元数据不知何故丢失，并且 mdadm 不允许您在不将元数据重写到磁盘的情况下重新配置 RAID5，我不惜一切代价试图避免这种情况，我不想要冒把事情搞砸和丢失数据的风险，无论这种可能性有多小。

原文

I was writing a simple python script to read from and reconstruct data from a failed RAID5 array that I've been unable to rebuild in any other way. My script is running but slowly. My original script ran at about 80MB/min. I've since improved the script and it's running at 550MB/min but that still seems a bit low. The python script is sitting at 100% CPU, so it seems to be CPU rather than disk limited, which means I have opportunity for optimization. Because the script isn't very long at all I am unable to profile it effectively, so I don't know what's eating it all up. Here's my script as it stands right now (or at least, the important bits)

disk0chunk = disk0.read(chunkSize)
#disk1 is missing, bad firmware
disk2chunk = disk2.read(chunkSize)
disk3chunk = disk3.read(chunkSize)
if (parityDisk % 4 == 1): #if the parity stripe is on the missing drive
  output.write(disk0chunk + disk2chunk + disk3chunk)
else: #we need to rebuild the data in disk1
  # disk0num = map(ord, disk0chunk) #inefficient, old code
  # disk2num = map(ord, disk2chunk) #inefficient, old code
  # disk3num = map(ord, disk3chunk) #inefficient, old code
  disk0num = struct.depack("16384l", disk0chunk) #more efficient new code
  disk2num = struct.depack("16384l", disk2chunk) #more efficient new code
  disk3num = struct.depack("16384l", disk3chunk) #more efficient new code
  magicpotato = zip(disk0num,disk2num,disk3num)
  disk1num = map(takexor, magicpotato)
  # disk1bytes = map(chr, disk1num) #inefficient, old code
  # disk1chunk = ''.join(disk1bytes) #inefficient, old code
  disk1chunk = struct.pack("16384l", *disk1num) #more efficient new code

  #output nonparity to based on parityDisk

def takexor(magicpotato):
  return magicpotato[0]^magicpotato[1]^magicpotato[2]

Bolding to denote the actual questions inside this giant block of text:

Is there anything I can be doing to make this faster/better? If nothing comes to mind, is there anything I can do to better research into what is making this go slowly? (Is there even a way to profile python at a per line level?) Am I even handling this the right way, or is there a better way to handle massive amounts of binary data?

The reason I ask is I have a 3TB drive rebuilding and even though it's working correctly (I can mount the image ro,loop and browse files fine) it's taking a long time. I measured it as taking until mid-January with the old code, now it's going to take until Christmas (so it's way better but it's still slower than I expected it to be.)

Before you ask, this is an mdadm RAID5 (64kb blocksize, left symmetric) but the mdadm metadata is missing somehow and mdadm does not allow you to reconfigure a RAID5 without rewriting the metadata to the disk, which I am trying to avoid at all costs, I don't want to risk screwing something up and losing data, however remote the possibility may be.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

慕烟庭风 2024-10-15 18:16:00

map(takexor, magicpotato) - 这可能最好通过直接迭代来完成，如果地图需要调用其他Python代码，据我所知，地图效率不高，它需要构造和销毁16384个帧对象来执行调用等。
使用数组模块而不是 struct
如果它仍然太慢，请使用 cython 编译它并添加一些静态类型（这可能会使其成为 2-3 个订单））

回复收藏 0 原文