在Pyspark中阅读划分的镶木记录
我有一个由日期字段(yyyy-mm-dd)分区的镶木quet文件。 如何在Pyspark中有效地读取文件中的(当前日期1天)记录 - 请建议。
PS:我不想读取整个文件,然后因为数据量很大而过滤记录。
I have a parquet file partitioned by a date field (YYYY-MM-DD).
How to read the (current date-1 day) records from the file efficiently in Pyspark - please suggest.
PS: I would not like to read the entire file and then filter the records as the data volume is huge.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
有多种方法可以解决:
假设这是输入数据,您可以在“日期”列上写出分区的数据框:
您可以读取与此语法相关的给定日期关联的木板文件:
更有效的解决方案是使用Delta表:
Spark Engine使用_DELTA_LOG为了优化查询,仅读取适用于查询的镶木quet文件。另外,输出将具有所有列:
There are multiple ways to go about this:
Suppose this is the input data and you write out the dataframe partitioned on "date" column:
You can read the parquet files associated to a given date with this syntax:
The more efficient solution is using the delta tables:
The spark engine uses the _delta_log to optimize your query and only reads the parquet files that are applicable to your query. Also, the output will have all the columns:
您可以在阅读时通过日期变量读取它。
这是动态代码,您也不需要硬码日期,只需将其附加到路径上
you can read it by passing date variable while reading.
This is dynamic code, you nor need to hardcode date, just append it with path