parquet4s未返回所有记录
我有一个简单的 Scala 应用程序,它使用 parquet4s 和 fs2 来读取一组分区记录(分布在目录中,由 Spark 作业生成)。
当我运行该应用程序时,它仅返回分区目录中的一小部分记录。没有错误......它只是在特定数字后停止。
控制台输出没有有用的信息:
2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] RecordReader initialized will read a total of 83 records.
2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] at row 0. reading next block
2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] block read in memory in 0 ms. row count = 83
使用 Python 中的 pyArrow 编写的等效应用程序能够检索所有记录。
感谢您对调试此问题的任何帮助。
谢谢。
PS - 作为参考,这是我使用的示例程序: https://mjakubowski84.github.io/parquet4s/docs/partitioning/
I have a simple Scala application that uses parquet4s with fs2 to read a set of partitioned records (spread across directories, generated by a Spark job).
When I run the app, it only returns a fraction of the records from the partitioned directories. There are no errors ... it just stops after a specific number.
The console output has no useful information:
2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] RecordReader initialized will read a total of 83 records.
2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] at row 0. reading next block
2022-04-07 18:32:29,231 |-INFO org.apache.parquet.hadoop.InternalParquetRecordReader [io-compute-4] block read in memory in 0 ms. row count = 83
An equivalent app written using pyArrow in Python is able to retrieve all records.
Any help in debugging this issue is appreciated.
Thank you.
PS - For reference, this is the sample program I use:
https://mjakubowski84.github.io/parquet4s/docs/partitioning/
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看起来包含分区数据的目录的名称具有不可允许的字符,从而导致读者跳过它们,从而返回较少的记录。
It looks like names of directories containing the partitioned data had characters that are not admissible, causing the reader to skip them, thereby returning fewer records.