我正在尝试使用Awswrangler来读取pandas dataframe一个任意的大型镶木式文件,但由于文件的大小而将我的查询限制为第一个 n
行(以及我差的带宽)。
我看不到该怎么做,或者在不重新安置的情况下是否可以做到这一点。
我可以使用 bunged = Integer
,然后在阅读第一个块后流产,如果是这样,请如何?
我使用pyarrow遇到了这个不完整的解决方案(最后一个n行;)) - 读取S3 Parquet表的最后一行 - 但是,基于时间的过滤器对我来说并不理想,并且被接受的解决方案甚至没有到达故事的结尾(尽管如此)。
还是没有先下载文件的情况下有其他方法(我现在可能已经做到了)?
谢谢!
I am trying to use awswrangler to read into a pandas dataframe an arbitrarily-large parquet file stored in S3, but limiting my query to the first N
rows due to the file's size (and my poor bandwidth).
I cannot see how to do it, or whether it is even possible without relocating.
Could I use chunked=INTEGER
and abort after reading the first chunk, say, and if so how?
I have come across this incomplete solution (last N rows ;) ) using pyarrow - Read last N rows of S3 parquet table - but a time-based filter would not be ideal for me and the accepted solution doesn't even get to the end of the story (helpful as it is).
Or is there another way without first downloading the file (which I could probably have done by now)?
Thanks!
发布评论
评论(2)
您可以使用awswrangler使用。例如:
只能从S3对象返回5行。
其他读取方法是不可能的,因为必须在读取整个对象之前将整个对象拉动。使用S3选择,过滤是在服务器端进行的
You can do that with awswrangler using S3 Select. For example:
would return 5 rows only from the S3 object.
This is not possible with other read methods because the entire object must be pulled locally before reading it. With S3 select, the filtering is done on the server side instead
以防万一,如果您遇到S3 Select的MaxparquetBlocksize错误
。
是的,您可以使用“返回”或“提高停止式”(如果使用前3.7前3.7)来终止发电机
:
Just in case if you encounter OverMaxParquetBlockSize error with s3 select..
Could I use chunked=INTEGER and abort after reading the first chunk, say, and if so how?
Yes, you can use 'return' or 'raise StopIteration' (if using pre-Python 3.7) to terminate the generator
For example: