Python：如何批次读取Glob Batch的CSV文件，并存储在数据框架中

发布于 2025-02-12 17:49:30 字数 716 浏览 2 评论 0原文

我在文件夹中有大量 .csv 文件。所有 .csv 文件具有相同的列名。以下代码合并了所有 .csv 文件。但是我必须在下一步11到20之后将前10个 .csv 文件合并。问题/23735529/how-to-to-to-lob-to-to-to-to-to-to-to-limited-of-files-with-numeric-names“>解决方案1 和解决方案2 如果文件名称是数字，但我的情况文件名是不遵循任何模式。

# Merge .csv files in one place
import glob
import os
import pandas as pd  

path = r'D:\Course\Research\Data\2017-21'  
print(path)
all_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pd.read_csv(f,encoding='utf8',error_bad_lines=False) for f in all_files)
merged_df = pd.concat(df_from_each_file)

原文

I have a large number of .csv files in a folder. All .csv files have the same column names. The below code merges all the .csv files. But I have to merge the top 10 .csv files in one DataFrame after that 11 to 20 in the next step and so on... The solution 1 and solution 2 are suitable if file names are numeric but in my case file names are not following any pattern.

# Merge .csv files in one place
import glob
import os
import pandas as pd  

path = r'D:\Course\Research\Data\2017-21'  
print(path)
all_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pd.read_csv(f,encoding='utf8',error_bad_lines=False) for f in all_files)
merged_df = pd.concat(df_from_each_file)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

仅此而已 2025-02-19 17:49:31

除了上面的评论，这是一个更简单的解决方案。

所有必需的CSV文件均由Glob收集。在当前状态下，该列表是的排序，但可以根据您的要求，
将文件列表迭代在10条文件厨房中，
每个块被读取并将其串联到合并的dataFrame中：<代码> dfm
对数据框进行操作
to_csv示例使用随机的4字节十六进制字符串来确保唯一性 *对输出文件

*注意：注意：此IS 不是保证的唯一性，但我正在使用的50个示例数据文件就足够了。

示例代码：

import os
import pandas as pd
from glob import glob

dfm = pd.DataFrame()
files = glob(os.path.join('./csv2df', 'file*.csv'))  # 50 CSV files

for i in range(0, len(files), 10):
    dfm = pd.concat(pd.read_csv(f) for f in files[i:i+10])
    # Do whatever you want with the merged DataFrame.
    print(dfm.head(10), dfm.shape)
    print('\n')
    # Write to CSV?
    dfm.to_csv(f'./csv2df/merged_{os.urandom(4).hex()}.csv', index=False)

输出：

以下是打印语句的示例输出：

      col1     col2     col3     col4
0   file49   file49   file49   file49
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4
4  data4.1  data4.2  data4.3  data4.4
5  data5.1  data5.2  data5.3  data5.4
0   file30   file30   file30   file30
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4 (60, 4)
    
...
      col1     col2     col3     col4
0   file14   file14   file14   file14
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4
4  data4.1  data4.2  data4.3  data4.4
5  data5.1  data5.2  data5.3  data5.4
0   file42   file42   file42   file42
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4 (60, 4)

CSV文件列表：

merged_5314ad49.csv
merged_5499929e.csv
merged_5f4e306a.csv
merged_74746bd8.csv
merged_b9def1d6.csv

Further to my comment above, here is a more simple solution.

All required CSV files are collected by glob. In its current state, the list is not sorted, but can be according to your requirements
The list of files is iterated in 10-file-chunks
Each chunk is read and concatenated together into the merged DataFrame: dfm
Do whatever you like with the DataFrame
The to_csv example uses a random 4-byte hex string to ensure uniqueness* over the output files

*Note: This is not guaranteed uniqueness, but will suffice with the 50 sample data files I was using.

Sample code:

import os
import pandas as pd
from glob import glob

dfm = pd.DataFrame()
files = glob(os.path.join('./csv2df', 'file*.csv'))  # 50 CSV files

for i in range(0, len(files), 10):
    dfm = pd.concat(pd.read_csv(f) for f in files[i:i+10])
    # Do whatever you want with the merged DataFrame.
    print(dfm.head(10), dfm.shape)
    print('\n')
    # Write to CSV?
    dfm.to_csv(f'./csv2df/merged_{os.urandom(4).hex()}.csv', index=False)

Output:

The following is a sample output from the print statements:

      col1     col2     col3     col4
0   file49   file49   file49   file49
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4
4  data4.1  data4.2  data4.3  data4.4
5  data5.1  data5.2  data5.3  data5.4
0   file30   file30   file30   file30
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4 (60, 4)
    
...
      col1     col2     col3     col4
0   file14   file14   file14   file14
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4
4  data4.1  data4.2  data4.3  data4.4
5  data5.1  data5.2  data5.3  data5.4
0   file42   file42   file42   file42
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4 (60, 4)

CSV file list:

merged_5314ad49.csv
merged_5499929e.csv
merged_5f4e306a.csv
merged_74746bd8.csv
merged_b9def1d6.csv

回复收藏 0 原文

勿忘初心 2025-02-19 17:49:31

这是一个建议使用 islice> islice> islice（islice）（） 来自标准库模块itertools以获取多达10个文件的块：（

from pathlib import Path
from itertools import islice
import pandas as pd

csv_files = Path(r"D:\Course\Research\Data\2017-21").glob("*.csv")
while True:
    files = list(islice(csv_files, 10))
    if not files:
        break
    dfs = (pd.read_csv(file) for file in files)
    merged_df = pd.concat(dfs, ignore_index=True)
    # Do whatever you want to do with merged_df
    print(merged_df)

我还在使用标准库模块 pathlib pathlib 因为它更方便。

Here's a suggestion that is using islice() from the standard library module itertools to fetch chunks of up to 10 files:

from pathlib import Path
from itertools import islice
import pandas as pd

csv_files = Path(r"D:\Course\Research\Data\2017-21").glob("*.csv")
while True:
    files = list(islice(csv_files, 10))
    if not files:
        break
    dfs = (pd.read_csv(file) for file in files)
    merged_df = pd.concat(dfs, ignore_index=True)
    # Do whatever you want to do with merged_df
    print(merged_df)

(I'm also using the standard library module pathlib because it's more convenient.)

回复收藏 0 原文

~没有更多了~