递归 递归 递归 --- 如何提高性能? (Python档案递归提取)

发布于 2024-09-10 21:21:13 字数 1988 浏览 0 评论 0原文

我正在尝试开发一个递归提取器。问题是,它递归太多(每当它找到存档类型时)并且性能受到影响。

那么我该如何改进下面的代码呢?

我的想法1:

首先获取目录的“Dict”,以及文件类型。Filetypes作为键。提取文件类型。当找到存档时,仅提取该存档。然后再次重新生成存档字典。

我的想法2:

os.walk返回生成器。那么我可以用发电机做点什么吗?我是发电机的新手。

这是当前的代码:

import os, magic
m = magic.open( magic.MAGIC_NONE )
m.load()

archive_type = [ 'gzip compressed data',
        '7-zip archive data',
        'Zip archive data',
        'bzip2 compressed data',
        'tar archive',
        'POSIX tar archive',
        'POSIX tar archive (GNU)',
        'RAR archive data',
        'Microsoft Outlook email folder (>=2003)',
        'Microsoft Outlook email folder']

def extractRecursive( path ,archives):
    i=0
    for dirpath, dirnames, filenames in os.walk( path ):
        for f in filenames:
            fp = os.path.join( dirpath, f )
            i+=1
            print i
            file_type = m.file( fp ).split( "," )[0]
            if file_type in archives:
                arcExtract(fp,file_type,path,True)
                extractRecursive(path,archives)
    return "Done"



def arcExtract(file_path,file_type,extracted_path="/home/v3ss/Downloads/extracted",unlink=False):
    import subprocess,shlex


    if file_type in pst_types:
        cmd = "readpst -o  '%s' -S '%s'" % (extracted_path,file_path)
    else:
        cmd = "7z -y -r -o%s x '%s'" % (extracted_path,file_path)

    print cmd
    args= shlex.split(cmd)
    print args

    try:
        sp = subprocess.Popen( args, shell = False, stdout = subprocess.PIPE, stderr = subprocess.PIPE )
        out, err = sp.communicate()
        print out, err
        ret = sp.returncode
    except OSError:
        print "Error no %s  Message %s" % (OSError.errno,OSError.message)
        pass

    if ret == 0:
        if unlink==True:
            os.unlink(file_path)
        return "OK!"
    else:
        return "Failed"
if __name__ == '__main__':
    extractRecursive( 'Path/To/Archives' ,archive_type)

I am trying to develop a Recursive Extractor. The problem is , it is Recursing Too Much (Evertime it found an archive type) and taking a performance hit.

So how can i improve below code?

My Idea 1:

Get the 'Dict' of direcories first , together with file types.Filetypes as Keys. Extract the file types. When an Archive is found Extract only that one. Then Regenerate Archive Dict again.

My Idea 2:

os.walk returns Generator. So is there something i can do with generators? I am new to Generators.

here is the current code :

import os, magic
m = magic.open( magic.MAGIC_NONE )
m.load()

archive_type = [ 'gzip compressed data',
        '7-zip archive data',
        'Zip archive data',
        'bzip2 compressed data',
        'tar archive',
        'POSIX tar archive',
        'POSIX tar archive (GNU)',
        'RAR archive data',
        'Microsoft Outlook email folder (>=2003)',
        'Microsoft Outlook email folder']

def extractRecursive( path ,archives):
    i=0
    for dirpath, dirnames, filenames in os.walk( path ):
        for f in filenames:
            fp = os.path.join( dirpath, f )
            i+=1
            print i
            file_type = m.file( fp ).split( "," )[0]
            if file_type in archives:
                arcExtract(fp,file_type,path,True)
                extractRecursive(path,archives)
    return "Done"



def arcExtract(file_path,file_type,extracted_path="/home/v3ss/Downloads/extracted",unlink=False):
    import subprocess,shlex


    if file_type in pst_types:
        cmd = "readpst -o  '%s' -S '%s'" % (extracted_path,file_path)
    else:
        cmd = "7z -y -r -o%s x '%s'" % (extracted_path,file_path)

    print cmd
    args= shlex.split(cmd)
    print args

    try:
        sp = subprocess.Popen( args, shell = False, stdout = subprocess.PIPE, stderr = subprocess.PIPE )
        out, err = sp.communicate()
        print out, err
        ret = sp.returncode
    except OSError:
        print "Error no %s  Message %s" % (OSError.errno,OSError.message)
        pass

    if ret == 0:
        if unlink==True:
            os.unlink(file_path)
        return "OK!"
    else:
        return "Failed"
if __name__ == '__main__':
    extractRecursive( 'Path/To/Archives' ,archive_type)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

国粹 2024-09-17 21:21:13

看起来,如果您想将存档文件提取到它们所在路径“上方”的路径,那么 os.walk 本身(在其正常的自上而下操作中)无济于事你(因为当你将档案解压到某个目录 x 时,os.walk 可能(尽管不一定)已经考虑了目录 x —— 所以只有让 os.walk 一遍又一遍地查看整个路径,你才能获取所有内容)。除此之外,我很惊讶你的代码会终止,因为存档类型文件应该不断被找到和提取——我看不出什么可以终止递归。 (为了解决这个问题,保留一组已提取的存档类型文件的所有路径就足够了,以避免再次遇到它们时再次考虑它们)。

无论如何,到目前为止最好的架构是如果 arcExtract 返回它已提取的所有文件的列表(特别是它们的目标路径)——那么您可以简单地继续扩展包含所有这些文件的列表在 os.walk 循环期间提取文件(无递归),然后在列表上继续循环(无需不断询问操作系统有关文件和目录的信息,也节省了该操作的大量时间)并生成一个新的类似列表。没有递归,没有冗余的工作。我想 readpst7z 能够以某种文本形式提供此类列表(可能在其标准输出或错误上,您当前仅显示但不处理)您可以解析它并将其放入列表中...?

If, as it appears, you want to extract the archive files to paths "above" the one they're in, os.walk per se (in its normal top-down operation) can't help you (because by the time you extract an archive into a certain directory x, os.walk may likely, though not necessarily, already considered directory x -- so only by having os.walk look at the whole path over and over again can you get all contents). Except, I'm surprised your code ever terminates, since the archive-type files should keep getting found and extracted -- I don't see what can ever terminate the recursion. (To solve that it would suffice to keep a set of all the paths of archive-type files you've already extracted, to avoid considering them again when you meet them again).

By far the best architecture, anyway, would be if arcExtract was to return a list of all the files it has extracted (specifically their destination paths) -- then you could simply keep extending a list with all these extracted files during the os.walk loop (no recursion), and then keep looping just on the list (no need to keep asking the OS about files and directories, saving lots of time on that operation too) and producing a new similar list. No recursion, no redundancy of work. I imagine that readpst and 7z are able to supply such lists (maybe on their standard output or error, which you currently just display but don't process) in some textual form that you could parse to make it into a list...?

微暖i 2024-09-17 21:21:13

您可以简化 extractRecursive 方法以按应使用的方式使用 os.walk。 os.walk 已经读取了所有子目录,因此不需要递归。

只需删除递归调用,它就应该可以工作:)

def extractRecursive(path, archives, extracted_archives=None):
    i = 0
    if not extracted_archives:
        extracted_archives = set()

    for dirpath, dirnames, filenames in os.walk(path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            i += 1
            print i
            file_type = m.file(fp).split(',')[0]
            if file_type in archives and fp not in extracted_archives:
                extracted_archives.add(fp)
                extracted_in.add(dirpath)
                arcExtract(fp, file_type, path, True)

    for path in extracted_in:
        extractRecursive(path, archives, extracted_archives)

    return "Done"

You can simplify your extractRecursive method to use os.walk as it should be used. os.walk already reads all subdirectories so your recursion is unneeded.

Simply remove the recursive call and it should work :)

def extractRecursive(path, archives, extracted_archives=None):
    i = 0
    if not extracted_archives:
        extracted_archives = set()

    for dirpath, dirnames, filenames in os.walk(path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            i += 1
            print i
            file_type = m.file(fp).split(',')[0]
            if file_type in archives and fp not in extracted_archives:
                extracted_archives.add(fp)
                extracted_in.add(dirpath)
                arcExtract(fp, file_type, path, True)

    for path in extracted_in:
        extractRecursive(path, archives, extracted_archives)

    return "Done"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文