Python:如何批次读取Glob Batch的CSV文件,并存储在数据框架中

发布于 2025-02-12 17:49:30 字数 716 浏览 2 评论 0原文

我在文件夹中有大量 .csv 文件。所有 .csv 文件具有相同的列名。以下代码合并了所有 .csv 文件。但是我必须在下一步11到20之后将前10个 .csv 文件合并。问题/23735529/how-to-to-to-lob-to-to-to-to-to-to-to-limited-of-files-with-numeric-names“>解决方案1 ​​和解决方案2 如果文件名称是数字,但我的情况文件名是不遵循任何模式。

# Merge .csv files in one place
import glob
import os
import pandas as pd  

path = r'D:\Course\Research\Data\2017-21'  
print(path)
all_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pd.read_csv(f,encoding='utf8',error_bad_lines=False) for f in all_files)
merged_df = pd.concat(df_from_each_file)

I have a large number of .csv files in a folder. All .csv files have the same column names. The below code merges all the .csv files. But I have to merge the top 10 .csv files in one DataFrame after that 11 to 20 in the next step and so on... The solution 1 and solution 2 are suitable if file names are numeric but in my case file names are not following any pattern.

# Merge .csv files in one place
import glob
import os
import pandas as pd  

path = r'D:\Course\Research\Data\2017-21'  
print(path)
all_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pd.read_csv(f,encoding='utf8',error_bad_lines=False) for f in all_files)
merged_df = pd.concat(df_from_each_file)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

仅此而已 2025-02-19 17:49:31

除了上面的评论,这是一个更简单的解决方案。

  • 所有必需的CSV文件均由Glob收集。在当前状态下,该列表是 的排序,但可以根据您的要求,
  • 将文件列表迭代在10条文件厨房中,
  • 每个块被读取并将其串联到合并的dataFrame中:<代码> dfm
  • 对数据框进行操作
  • to_csv示例使用随机的4字节十六进制字符串来确保唯一性 *对输出文件

*注意:注意:此IS 不是保证的唯一性,但我正在使用的50个示例数据文件就足够了。

示例代码:

import os
import pandas as pd
from glob import glob

dfm = pd.DataFrame()
files = glob(os.path.join('./csv2df', 'file*.csv'))  # 50 CSV files

for i in range(0, len(files), 10):
    dfm = pd.concat(pd.read_csv(f) for f in files[i:i+10])
    # Do whatever you want with the merged DataFrame.
    print(dfm.head(10), dfm.shape)
    print('\n')
    # Write to CSV?
    dfm.to_csv(f'./csv2df/merged_{os.urandom(4).hex()}.csv', index=False)

输出:

以下是打印语句的示例输出:

      col1     col2     col3     col4
0   file49   file49   file49   file49
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4
4  data4.1  data4.2  data4.3  data4.4
5  data5.1  data5.2  data5.3  data5.4
0   file30   file30   file30   file30
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4 (60, 4)
    
...
      col1     col2     col3     col4
0   file14   file14   file14   file14
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4
4  data4.1  data4.2  data4.3  data4.4
5  data5.1  data5.2  data5.3  data5.4
0   file42   file42   file42   file42
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4 (60, 4)

CSV文件列表:

merged_5314ad49.csv
merged_5499929e.csv
merged_5f4e306a.csv
merged_74746bd8.csv
merged_b9def1d6.csv

Further to my comment above, here is a more simple solution.

  • All required CSV files are collected by glob. In its current state, the list is not sorted, but can be according to your requirements
  • The list of files is iterated in 10-file-chunks
  • Each chunk is read and concatenated together into the merged DataFrame: dfm
  • Do whatever you like with the DataFrame
  • The to_csv example uses a random 4-byte hex string to ensure uniqueness* over the output files

*Note: This is not guaranteed uniqueness, but will suffice with the 50 sample data files I was using.

Sample code:

import os
import pandas as pd
from glob import glob

dfm = pd.DataFrame()
files = glob(os.path.join('./csv2df', 'file*.csv'))  # 50 CSV files

for i in range(0, len(files), 10):
    dfm = pd.concat(pd.read_csv(f) for f in files[i:i+10])
    # Do whatever you want with the merged DataFrame.
    print(dfm.head(10), dfm.shape)
    print('\n')
    # Write to CSV?
    dfm.to_csv(f'./csv2df/merged_{os.urandom(4).hex()}.csv', index=False)

Output:

The following is a sample output from the print statements:

      col1     col2     col3     col4
0   file49   file49   file49   file49
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4
4  data4.1  data4.2  data4.3  data4.4
5  data5.1  data5.2  data5.3  data5.4
0   file30   file30   file30   file30
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4 (60, 4)
    
...
      col1     col2     col3     col4
0   file14   file14   file14   file14
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4
4  data4.1  data4.2  data4.3  data4.4
5  data5.1  data5.2  data5.3  data5.4
0   file42   file42   file42   file42
1  data1.1  data1.2  data1.3  data1.4
2  data2.1  data2.2  data2.3  data2.4
3  data3.1  data3.2  data3.3  data3.4 (60, 4)

CSV file list:

merged_5314ad49.csv
merged_5499929e.csv
merged_5f4e306a.csv
merged_74746bd8.csv
merged_b9def1d6.csv
勿忘初心 2025-02-19 17:49:31

这是一个建议使用 islice> islice> islice(islice)() 来自标准库模块itertools以获取多达10个文件的块:(

from pathlib import Path
from itertools import islice
import pandas as pd

csv_files = Path(r"D:\Course\Research\Data\2017-21").glob("*.csv")
while True:
    files = list(islice(csv_files, 10))
    if not files:
        break
    dfs = (pd.read_csv(file) for file in files)
    merged_df = pd.concat(dfs, ignore_index=True)
    # Do whatever you want to do with merged_df
    print(merged_df)

我还在使用标准库模块 pathlib pathlib 因为它更方便。

Here's a suggestion that is using islice() from the standard library module itertools to fetch chunks of up to 10 files:

from pathlib import Path
from itertools import islice
import pandas as pd

csv_files = Path(r"D:\Course\Research\Data\2017-21").glob("*.csv")
while True:
    files = list(islice(csv_files, 10))
    if not files:
        break
    dfs = (pd.read_csv(file) for file in files)
    merged_df = pd.concat(dfs, ignore_index=True)
    # Do whatever you want to do with merged_df
    print(merged_df)

(I'm also using the standard library module pathlib because it's more convenient.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文