如何使用 pyarrow.csv.read_csv 从文件系统读取文件？

发布于 2025-01-15 05:03:21 字数 439 浏览 3 评论 0原文

我想用 pyarrow 读取 google 存储桶中的单个 CSV 文件。我该怎么做？

我可以使用 gcsfs 创建一个 FileSystem 对象，但我没有找到将其提供给 pyarrow.csv.read_csv 的方法。

我是否需要从文件系统创建某种文件流？最好的方法是什么？

import gcsfs
import pyarrow.csv as csv

fs = gcsfs.GCSFileSystem(project='foo')

csv.read_csv("bucket/foo/bar.csv", filesystem=fs)

TypeError: read_csv() got an unexpected keyword argument 'filesystem'

使用 pyarrow 版本 6.0.1

原文

I want to read a single CSV file in a google bucket with pyarrow. How do I do this?

I can create a FileSystem object with gcsfs, but I don't see a way to provide this to pyarrow.csv.read_csv.

Do I need to create some sort of file stream from the file system? What's the best way to do this?

import gcsfs
import pyarrow.csv as csv

fs = gcsfs.GCSFileSystem(project='foo')

csv.read_csv("bucket/foo/bar.csv", filesystem=fs)

TypeError: read_csv() got an unexpected keyword argument 'filesystem'

Using pyarrow version 6.0.1

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

恋竹姑娘 2025-01-22 05:03:21

我猜您正在使用您是对的，其中列出的方法不适用于 read_csv，因为没有 filesystem 参数。通常我们仍然可以这样做，但过程有点不同。

Pyarrow 有自己的文件系统抽象。如果您有 pyarrow 文件系统，那么您可以首先打开一个文件，然后使用该文件读取 CSV：

import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.fs as fs

local_fs = fs.LocalFileSystem()
with local_fs.open_input_file('foo/bar.csv') as csv_file:
    csv.read_csv(csv_file)

不幸的是，gcsfs.GCSFileSystem 不是“pyarrow 文件系统”，但您有一些选择。

gcsfs.GCSFileSystem.open 方法可以为您提供一个“python 文件对象”，您可以将其用作 pyarrow.csv.read_csv 的输入。

import gcsfs
import pyarrow.csv as csv

fs = gcsfs.GCSFileSystem(project='foo')
with fs.open("bucket/foo/bar.csv", 'rb') as csv_file:
    csv.read_csv(csv_file)

I'm guessing you are working with this doc. You're correct that the approach listed there does not work with read_csv because there is no filesystem parameter. We can still generally do this but the process is a bit different.

Pyarrow has its own filesystem abstraction. If you have a pyarrow filesystem then you can first open a file and then use that file to read the CSV:

import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.fs as fs

local_fs = fs.LocalFileSystem()
with local_fs.open_input_file('foo/bar.csv') as csv_file:
    csv.read_csv(csv_file)

Unfortunately, a gcsfs.GCSFileSystem is not a "pyarrow filesystem" but you have a few options.

The method gcsfs.GCSFileSystem.open can give you a "python file object" which you can use as input to pyarrow.csv.read_csv.

import gcsfs
import pyarrow.csv as csv

fs = gcsfs.GCSFileSystem(project='foo')
with fs.open("bucket/foo/bar.csv", 'rb') as csv_file:
    csv.read_csv(csv_file)

回复收藏 0 原文

~没有更多了~