如何使用Arrow/Parquet C++图书馆?

发布于 2025-01-30 08:34:30 字数 325 浏览 3 评论 0 原文

我需要在GCS上访问镶木木quet格式化数据。我们正在使用适用于Apache Arrow和Parquet的C ++库。使用Parquet C ++库,对本地磁盘的阅读/写作相对简单。但是,如果一个人想做同样的事情,但是使用GC,那么努力似乎很复杂。我已经对此进行了一些研究。我注意到箭头中有一个GCS文件系统类,还有一个镶木适配器。不幸的是,GCS文件系统代码未包含在我们安装的库版本中(4.0.0)。不知道这是在解开和安装过程中选择的,还是是否没有可用。无论哪种方式,如果我们要在盒子上开发此功能,则需要进行某些管理工作。显然是可以的。除此之外,我想提出一个问题,看看是否有人以前一直在这条路上并可以提供一种方法。或建议的方法。谢谢!

I have a need to access Parquet formatted data on GCS. We are using the C++ libraries that are available for both Apache Arrow and Parquet. Reading/writing to local disk is relatively straightforward using the Parquet C++ library. However if one wants to do the same, but with GCS, the effort appears to be complicated. I've done some research into it. I've noticed that there is a GCS filesystem class available in Arrow, as well as a Parquet adapter. Unfortunately the GCS filesystem code isn't included in the version of the library that we have installed (4.0.0). Don't know if that was by option during the unpacking and installation process or if it wasn't available then. Either way, some admin work will be necessary if we are to develop this capability on our boxes. That is do-able obviously. That aside, I wanted to pose the question to see if anyone has been down this road before and could offer an approach. Or a suggested approach. Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

青春有你 2025-02-06 08:34:30

好吧,通过实验,我能够读取我在GCS服务器上的服务帐户中存储的镶木材料文件。要注意的第一个项目是,我下载并构建了箭头/镶木库的最新版本(截至此发布之日,版本8.0.0)。从7.0.0版本开始,GCS功能确实是新的。进一步注意,此版本的库构建要求您明确启用GCS功能,因为如果可以的话,它不会带有“默认构建”。最后,Arrow GCS有一些您需要注意的依赖项,即libcurl和openssl。我也建立了最新的n-greathe,以便进行这项工作。而且您需要GCS SDK。由于我们已经使用了GCS C ++库,所以这仅仅是指向我们现有的GCP库的问题。如果您没有GCP C ++库,则需要下载并构建它。

所有预定的预限值,以下是代码(出于清晰的原因,删除了例外处理等):

constexpr char jsonKey[] = R"""( [snipped for brevity reasons] )""";

const arrow::fs::GcsOptions gcsOptions = arrow::fs::GcsOptions::FromServiceAccountCredentials( jsonKey );

static std::shared_ptr<arrow::fs::GcsFileSystem> myGCSFileSystem = arrow::fs::GcsFileSystem::Make( gcsOptions );

const std::string path( "my_bucket/my_parquet_file.parquet" );
arrow::Result<std::shared_ptr<arrow::io::RandomAccessFile>> openResult = myGCSFileSystem->OpenInputFile( path );
std::shared_ptr<arrow::io::RandomAccessFile> arrowFile;
if( true == openResult.ok() )
{
   arrowFile = openResult.ValueOrDie();

}
else
{
   throw std::runtime_error( "Unable to open input file on GCS server" + openResult.status().ToString() );

}

// Create a ParquetFileReader instance
std::unique_ptr<parquet::ParquetFileReader> parquet_reader = parquet::ParquetFileReader::Open( arrowFile );

从那里,从GCS服务器上的Parquet文件读取元数据/数据的过程与在本地磁盘上读取它相同。因此,我将分配其包容性。

希望这会有所帮助。

Alright, through experimentation I was able to read a Parquet file that I stored in my service account on a GCS server. The first item to note is that I downloaded and built the latest version of the Arrow/Parquet libraries (version 8.0.0 as of the date of this posting). The GCS functionality is indeed new as of version 7.0.0. Note further that the build of this version of the library requires you to enable the GCS functionality explicitly, as it doesn't come with the "default build" if you will. Finally, Arrow GCS has a few dependencies that you'll need to be aware of, namely libcurl and openssl. I built the latest-n-greatest of those too in order to make this work. And you'll need the GCS SDK. Since we already use the GCS C++ library, that was merely a matter of pointing to our existing GCP library. If you don't have the GCP C++ library, you will need to download and build that as well.

All preliminaries accounted for, here is the code (exception handling removed for clarity reasons, etc.):

constexpr char jsonKey[] = R"""( [snipped for brevity reasons] )""";

const arrow::fs::GcsOptions gcsOptions = arrow::fs::GcsOptions::FromServiceAccountCredentials( jsonKey );

static std::shared_ptr<arrow::fs::GcsFileSystem> myGCSFileSystem = arrow::fs::GcsFileSystem::Make( gcsOptions );

const std::string path( "my_bucket/my_parquet_file.parquet" );
arrow::Result<std::shared_ptr<arrow::io::RandomAccessFile>> openResult = myGCSFileSystem->OpenInputFile( path );
std::shared_ptr<arrow::io::RandomAccessFile> arrowFile;
if( true == openResult.ok() )
{
   arrowFile = openResult.ValueOrDie();

}
else
{
   throw std::runtime_error( "Unable to open input file on GCS server" + openResult.status().ToString() );

}

// Create a ParquetFileReader instance
std::unique_ptr<parquet::ParquetFileReader> parquet_reader = parquet::ParquetFileReader::Open( arrowFile );

From there the process of reading metadata/data from a Parquet file on a GCS server is identical to reading it on a local disk. I will therefore dispense with its inclusion.

Hope this helps.

别挽留 2025-02-06 08:34:30

GCS自7.0.0以来都在箭头中支持(请参阅此处的发行说明: https://arrow.apache.org/释放/)。我想一个好的起点是文档:或来自测试的示例:

GCS is supported in Arrow since 7.0.0 (see release notes here: https://arrow.apache.org/release/). I suppose a good starting points would be docs: https://arrow.apache.org/docs/cpp/api/filesystem.html#google-cloud-storage-filesystem or examples from tests: https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/gcsfs_test.cc

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文