Windows 和 Linux 中的 Hashlib

发布于 2024-10-07 03:57:00 字数 1260 浏览 7 评论 0原文

我正在用 Python 编写一个 p2p 应用程序，并使用 hashlib 模块来识别网络中内容相同但名称不同的文件。

问题是，我使用 Python 2.7 测试了在 Windows (Vista) 中对文件进行哈希处理的代码，速度非常快（不到一秒，几 GB）。所以，在Linux中（Fedora 12，我自己编译的Python 2.6.2和Python 2.7.1，因为我还没有找到带有yum的rpm）慢得多，小于1gb的文件几乎要一分钟。

问题是，为什么？和我可以做一些事情来提高 Linux 的性能吗？

哈希的代码是

import hashlib
...

def crear_lista(directorio):

   lista = open(archivo, "w")

   for (root, dirs, files) in os.walk(directorio):
      for f in files:
         #archivo para hacerle el hash
         h = open(os.path.join(root, f), "r")

         #calcular el hash de los archivos
         md5 = hashlib.md5()

         while True:
            trozo = h.read(md5.block_size)
            if not trozo: break
            md5.update(trozo)

         #cada linea es el nombre de archivo y su hash
         size = str(os.path.getsize(os.path.join(root, f)) / 1024)
         digest = md5.hexdigest()

         #primera linea: nombre del archivo
         #segunda: tamaño en KBs
         #tercera: hash
         lines = f + "\n" + size + "\n" + digest + "\n"
         lista.write(lines)

         del md5
         h.close()

   lista.close()

我将 r 更改为rb 和 rU 但结果是相同的

原文

I'm writing a p2p application in Python and am using the hashlib module to identify files with the same contents but different names within the network.

The thing is that I tested the code that does the hash for the files in Windows (Vista), with Python 2.7 and it's very fast (less than a second, for a couple of gigabytes). So, in Linux (Fedora 12, with Python 2.6.2 and Python 2.7.1 compiled by myself because I haven't found a rpm with yum) is so much slower, almost a minute for files less than 1gb.

The question is, Why? and Can I do something to improve the performance in Linux?

The code for the hash is

import hashlib
...

def crear_lista(directorio):

   lista = open(archivo, "w")

   for (root, dirs, files) in os.walk(directorio):
      for f in files:
         #archivo para hacerle el hash
         h = open(os.path.join(root, f), "r")

         #calcular el hash de los archivos
         md5 = hashlib.md5()

         while True:
            trozo = h.read(md5.block_size)
            if not trozo: break
            md5.update(trozo)

         #cada linea es el nombre de archivo y su hash
         size = str(os.path.getsize(os.path.join(root, f)) / 1024)
         digest = md5.hexdigest()

         #primera linea: nombre del archivo
         #segunda: tamaño en KBs
         #tercera: hash
         lines = f + "\n" + size + "\n" + digest + "\n"
         lista.write(lines)

         del md5
         h.close()

   lista.close()

I changed the r by rb and rU but the results are the same

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏九 2024-10-14 03:57:00

您正在以 64 字节 (hashlib.md5().block_size) 块的形式读取文件并对它们进行哈希处理。

您应该使用 256KB（262144 字节）到 4MB（4194304 字节）范围内的更大读取值，然后对其进行哈希处理；这个 digup 程序读取 1MB 块，即：

block_size = 1048576 # 1MB
while True:
    trozo = h.read(block_size)
    if not trozo: break
    md5.update(trozo)

You're reading the file in 64 byte (hashlib.md5().block_size) blocks and hashing them.

You should use a much larger read value in the range of 256KB (262144 bytes) to 4MB (4194304 bytes) and then hash that; this one digup program reads in 1MB blocks i.e.:

block_size = 1048576 # 1MB
while True:
    trozo = h.read(block_size)
    if not trozo: break
    md5.update(trozo)

回复收藏 0 原文

~没有更多了~

关于作者

ˉ厌

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

Windows 和 Linux 中的 Hashlib

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

Windows 和 Linux 中的 Hashlib

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。