有没有办法捕获在 pandas/awswrangler 中使用通配符读取的多个镶木地板文件的输入文件名？

发布于 2025-01-17 18:35:58 字数 768 浏览 0 评论 0原文

这是以下火花问题的确切python类似物：

有没有方法可以捕获使用Spark中的通配符读取的多个Parquet文件的输入文件名？

我正在使用（使用（熊猫和awswrangler。

是否可以根据此Quesiton的火花版本来检索包含加载到最终组合数据框架中的每一行的原始文件名的列？

更新：这可能是一种做到这一点的方法 - 读取用Pyarrow保存为镶木的数据框，将文件名保存在列中

update2：当前的问题是 https：// stackoverflow.com/a/59682461/1021819

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你对谁都笑 2025-01-24 18:35:58

加载时，您需要将文件名作为新列将其添加到每个数据框架中。例如，这是如何使用一组CSV文件来执行此操作，因为这很容易作为示例运行。您将遵循类似的模式用于镶木点文件。

from pathlib import Path

import pandas as pd

# write a couple fake csv files to our disk as examples
Path("csv1.csv").write_text("a,b\n1,2\n1,2")
Path("csv2.csv").write_text("b,c\n3,4\n3,4")

all_dfs = []

# for every csv file that we want to load
for f in Path(".").glob("csv*.csv"):
    
    # read the csv
    df = pd.read_csv(f)
    
    # add a filename column to the dataframe
    df["filename"] = f.name
    
    # store the dataframe to concat later
    all_dfs.append(df)
    
# put together the dataframes for each file
pd.concat(all_dfs)
#>      a  b  filename    c
#> 0  1.0  2  csv1.csv  NaN
#> 1  1.0  2  csv1.csv  NaN
#> 0  NaN  3  csv2.csv  4.0
#> 1  NaN  3  csv2.csv  4.0

You'll need to add the filename as a new column to each dataframe as you load them. For example, here is how to do this with a set of CSV files since that is easier to run as an example. You'll follow a similar pattern for parquet files.

from pathlib import Path

import pandas as pd

# write a couple fake csv files to our disk as examples
Path("csv1.csv").write_text("a,b\n1,2\n1,2")
Path("csv2.csv").write_text("b,c\n3,4\n3,4")

all_dfs = []

# for every csv file that we want to load
for f in Path(".").glob("csv*.csv"):
    
    # read the csv
    df = pd.read_csv(f)
    
    # add a filename column to the dataframe
    df["filename"] = f.name
    
    # store the dataframe to concat later
    all_dfs.append(df)
    
# put together the dataframes for each file
pd.concat(all_dfs)
#>      a  b  filename    c
#> 0  1.0  2  csv1.csv  NaN
#> 1  1.0  2  csv1.csv  NaN
#> 0  NaN  3  csv2.csv  4.0
#> 1  NaN  3  csv2.csv  4.0

回复收藏 0 原文

~没有更多了~