我如何使用Awswrangler仅读取存储在S3中的木板文件的前几行?

发布于 2025-02-01 04:42:35 字数 494 浏览 4 评论 0 原文

我正在尝试使用Awswrangler来读取pandas dataframe一个任意的大型镶木式文件,但由于文件的大小而将我的查询限制为第一个 n 行(以及我差的带宽)。

我看不到该怎么做,或者在不重新安置的情况下是否可以做到这一点。

我可以使用 bunged = Integer ,然后在阅读第一个块后流产,如果是这样,请如何?

我使用pyarrow遇到了这个不完整的解决方案(最后一个n行;)) - 读取S3 Parquet表的最后一行 - 但是,基于时间的过滤器对我来说并不理想,并且被接受的解决方案甚至没有到达故事的结尾(尽管如此)。

还是没有先下载文件的情况下有其他方法(我现在可能已经做到了)?

谢谢!

I am trying to use awswrangler to read into a pandas dataframe an arbitrarily-large parquet file stored in S3, but limiting my query to the first N rows due to the file's size (and my poor bandwidth).

I cannot see how to do it, or whether it is even possible without relocating.

Could I use chunked=INTEGER and abort after reading the first chunk, say, and if so how?

I have come across this incomplete solution (last N rows ;) ) using pyarrow - Read last N rows of S3 parquet table - but a time-based filter would not be ideal for me and the accepted solution doesn't even get to the end of the story (helpful as it is).

Or is there another way without first downloading the file (which I could probably have done by now)?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

烟─花易冷 2025-02-08 04:42:35

您可以使用awswrangler使用。例如:

import awswrangler as wr

df = wr.s3.select_query(
        sql="SELECT * FROM s3object s limit 5",
        path="s3://amazon-reviews-pds/parquet/product_category=Gift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
        input_serialization="Parquet",
        input_serialization_params={},
        use_threads=True,
)

只能从S3对象返回5行。

其他读取方法是不可能的,因为必须在读取整个对象之前将整个对象拉动。使用S3选择,过滤是在服务器端进行的

You can do that with awswrangler using S3 Select. For example:

import awswrangler as wr

df = wr.s3.select_query(
        sql="SELECT * FROM s3object s limit 5",
        path="s3://amazon-reviews-pds/parquet/product_category=Gift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
        input_serialization="Parquet",
        input_serialization_params={},
        use_threads=True,
)

would return 5 rows only from the S3 object.

This is not possible with other read methods because the entire object must be pulled locally before reading it. With S3 select, the filtering is done on the server side instead

旧时模样 2025-02-08 04:42:35

以防万一,如果您遇到S3 Select的MaxparquetBlocksize错误

是的,您可以使用“返回”或“提高停止式”(如果使用前3.7前3.7)来终止发电机

    dataframes = wr.s3.read_parquet(
            path=file_path, chunked=5, boto3_session=boto3_session
        )
        chunk_counter = 0
        for dataframe in dataframes:
            if chunk_counter == 1:
               return # Terminate the generator
            chunk_counter = chunk_counter + 1
            yield dataframe

Just in case if you encounter OverMaxParquetBlockSize error with s3 select..

Could I use chunked=INTEGER and abort after reading the first chunk, say, and if so how?

Yes, you can use 'return' or 'raise StopIteration' (if using pre-Python 3.7) to terminate the generator

For example:

    dataframes = wr.s3.read_parquet(
            path=file_path, chunked=5, boto3_session=boto3_session
        )
        chunk_counter = 0
        for dataframe in dataframes:
            if chunk_counter == 1:
               return # Terminate the generator
            chunk_counter = chunk_counter + 1
            yield dataframe
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文