如何递归浏览目录和子目录以及记录顶级目录信息

发布于 2025-02-12 22:54:02 字数 1433 浏览 0 评论 0原文

我有以下目录结构,名为python-pathlib-scan-directory的目录

.
.
├── File_Extension_Review_20220704.ipynb
├── File_Extension_Review_SIMCARE_20220704.ipynb
├── Project1
│   ├── data_1.1.csv
│   ├── data_1.2.xlsx
│   ├── data_3.1.xlsx
│   └── info.txt
├── Project2
│   ├── data_2.1.csv
│   ├── data_2.2.xlsx
│   └── resources.docx
├── Project3
│   └── Info.txt
├── data_1.csv
├── data_2.csv
├── data_3.csv
├── output.csv
├── script_1.py
└── script_2.ipynb

3 directories, 16 files

我想在使用Collections Counter()中计算文件类型(扩展)的频率(扩展)这是大熊猫DF,通过将结果传递为一个命令。

我有以下代码来完成

dir_to_scan = Path("/Python-Pathlib-Scan-Directory")


all_files = []
# iterate recursively using rglob()
for i in dir_to_scan.rglob('*.*'):
    if i.is_file():
        all_files.append(i.suffix)

# Count values and return key:value pair denoting ext. and count
data = collections.Counter(all_files)
data

df = pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={"index":"Extension", 0:"Count"})
df

Output:

Extension   Count
.csv        6
.ipynb      3
.py         1
.txt        2
.xlsx       3
.docx       1


我的问题,这是在目录级别汇总的,而我希望它在每个级别(root Directory,project1 subdirectory,project2 subdirectory等)进行总结,因此我可能会在DF中contect concat结果。有一个额外的列指定目录并显示计数,所以我什至可以通过path.parent进行分组?

关于最佳方法的任何建议吗?

还要注意,我只想在给定目录中的文件加入文件时,而不仅仅是遍历所有文件并立即将所有文件串在一起。

I have the following directory structure, directory named Python-Pathlib-Scan-Directory

.
.
├── File_Extension_Review_20220704.ipynb
├── File_Extension_Review_SIMCARE_20220704.ipynb
├── Project1
│   ├── data_1.1.csv
│   ├── data_1.2.xlsx
│   ├── data_3.1.xlsx
│   └── info.txt
├── Project2
│   ├── data_2.1.csv
│   ├── data_2.2.xlsx
│   └── resources.docx
├── Project3
│   └── Info.txt
├── data_1.csv
├── data_2.csv
├── data_3.csv
├── output.csv
├── script_1.py
└── script_2.ipynb

3 directories, 16 files

I want to count the frequency of file types (extensions) within using Collections Counter() and return this as a Pandas df by passing in the results as a Dict.

I have the following code that does this

dir_to_scan = Path("/Python-Pathlib-Scan-Directory")


all_files = []
# iterate recursively using rglob()
for i in dir_to_scan.rglob('*.*'):
    if i.is_file():
        all_files.append(i.suffix)

# Count values and return key:value pair denoting ext. and count
data = collections.Counter(all_files)
data

df = pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={"index":"Extension", 0:"Count"})
df

Output:

Extension   Count
.csv        6
.ipynb      3
.py         1
.txt        2
.xlsx       3
.docx       1


My issue is that this summarises at the directory level while I want it to summarise at each level (Root directory, Project1 subdirectory, Project2 subdirectory etc.) instead so I maybe concat results together in a df, have an extra column specifying directory and show counts so I may group by later even, use path.parent perhaps?

Any suggestions on the best way to approach this?

Also mindful that I could want to use something similar when just concatenating files in given directories and not just walking through all and concatenating all files together at once.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

半夏半凉 2025-02-19 22:54:02

使用Python Standard Libray pathlib 模块和a 递归函数,这是一种方法:

from pathlib import Path

def scan(target, results=None):
    """Helper function that scans a directory
    and its sub-directories for file extensions.

    Args:
        target: target directory.
        results: dictionary to collect results. Defaults to None.

    Returns:
        dictionary which keys are the scanned directories
        and values are the collected extensions.

    """
    if not results:
        results = {}
    results[str(Path(target))] = []
    for item in Path(target).glob("*"):
        if not item.is_file():
            scan(item, results)
        else:
            suffix = item.suffix if item.suffix else "no_ext"
            results[str(Path(target))].append(suffix)
    return results

因此,给定一个伪造目录,其中包含有和没有扩展的几个子目录和文件:

from collections import Counter

import pandas as pd

results = scan(r"C:\fake_dir")

# Count values and instantiate dataframe
df = pd.DataFrame(
    [dict(Counter(value)) for value in results.values()], index=results.keys()
).fillna(0)

# Sort columns ("no_ext" meaning "Files without extension" appears last)
df = df.reindex(columns=sorted(df.columns))
print(df)
# Output
                                          .docx  .ini  .jpeg  .jpg  .pdf  \
C:\fake_dir                                 1.0   1.0    0.0   0.0   1.0   
C:\fake_dir\fake_data                       0.0   0.0    0.0   4.0   0.0   
C:\fake_dir\fake_data\empty_dir             0.0   0.0    0.0   0.0   0.0   
C:\fake_dir\fake_data\source_dir            0.0   0.0    1.0   2.0   0.0   
C:\fake_dir\fake_data\source_dir\sub_dir    0.0   0.0    0.0   0.0   0.0   

                                          .png  .raw  .tif  no_ext  
C:\fake_dir                                0.0   0.0   0.0     0.0  
C:\fake_dir\fake_data                      0.0   0.0   0.0     1.0  
C:\fake_dir\fake_data\empty_dir            0.0   0.0   0.0     0.0  
C:\fake_dir\fake_data\source_dir           1.0   0.0   2.0     0.0  
C:\fake_dir\fake_data\source_dir\sub_dir   0.0   1.0   0.0     1.0

Using Python standard libray Pathlib module and a recursive function, here is one way to do it:

from pathlib import Path

def scan(target, results=None):
    """Helper function that scans a directory
    and its sub-directories for file extensions.

    Args:
        target: target directory.
        results: dictionary to collect results. Defaults to None.

    Returns:
        dictionary which keys are the scanned directories
        and values are the collected extensions.

    """
    if not results:
        results = {}
    results[str(Path(target))] = []
    for item in Path(target).glob("*"):
        if not item.is_file():
            scan(item, results)
        else:
            suffix = item.suffix if item.suffix else "no_ext"
            results[str(Path(target))].append(suffix)
    return results

And so, given a fake directory which contains several sub-directories and files with and without extensions:

from collections import Counter

import pandas as pd

results = scan(r"C:\fake_dir")

# Count values and instantiate dataframe
df = pd.DataFrame(
    [dict(Counter(value)) for value in results.values()], index=results.keys()
).fillna(0)

# Sort columns ("no_ext" meaning "Files without extension" appears last)
df = df.reindex(columns=sorted(df.columns))
print(df)
# Output
                                          .docx  .ini  .jpeg  .jpg  .pdf  \
C:\fake_dir                                 1.0   1.0    0.0   0.0   1.0   
C:\fake_dir\fake_data                       0.0   0.0    0.0   4.0   0.0   
C:\fake_dir\fake_data\empty_dir             0.0   0.0    0.0   0.0   0.0   
C:\fake_dir\fake_data\source_dir            0.0   0.0    1.0   2.0   0.0   
C:\fake_dir\fake_data\source_dir\sub_dir    0.0   0.0    0.0   0.0   0.0   

                                          .png  .raw  .tif  no_ext  
C:\fake_dir                                0.0   0.0   0.0     0.0  
C:\fake_dir\fake_data                      0.0   0.0   0.0     1.0  
C:\fake_dir\fake_data\empty_dir            0.0   0.0   0.0     0.0  
C:\fake_dir\fake_data\source_dir           1.0   0.0   2.0     0.0  
C:\fake_dir\fake_data\source_dir\sub_dir   0.0   1.0   0.0     1.0
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文