使用Pyspark处理大量小型JSON文件
我在的
文件中有376K
jsons3
的目录下的文件。这些文件是2.5 kb
每个文件,仅包含一个记录/文件。当我尝试通过以下代码通过胶水ETL
使用20工人
:
spark.read.json("path")
它没有运行时。 5小时后有一个超时
。因此,我开发并运行了shell脚本
以在一个文件下合并这些文件的记录,当我尝试加载它时,它只是显示一个记录。合并的文件大小为980 MB
。在单个文件中合并了这4个记录后,在本地测试时,它对4个记录工作正常。它显示了预期的4个记录。
我使用以下命令在一个文件下从不同文件中附加JSON记录:
for f in Agent/*.txt; do cat ${f} >> merged.json;done;
它没有任何嵌套的JSON。我什至尝试了Multiline
选项,但不起作用。那么,在这种情况下该怎么办?按照我的态度,当合并时,它不会单独处理记录,从而导致问题。我什至尝试了head -n 10
显示前10行
,但它转到了无限循环。
I have around 376K
of JSON
files under a directory in S3
. These files are 2.5 KB
each and contain only a single record/file. When I tried to load the entire directory via the below code via Glue ETL
with 20 workers
:
spark.read.json("path")
It just didn't run. There was a Timeout
after 5 hrs. So, I developed and ran a shell script
to merge the records of these files under a single file, and when I tried to load it, it just displays a single record. The merged file size is 980 MB
. It worked fine for 4 records when tested locally after merging those 4 records under a single file. It displayed 4 records as expected.
I used the below command to append the JSON records from different files under a single file:
for f in Agent/*.txt; do cat ${f} >> merged.json;done;
It doesn't have any nested JSON. I even tried the multiline
option but didn't work. So, what could be done in this case? As per me, when merged it is not treating records separately hence causing the issue. I even tried head -n 10
to display the top 10 lines
but it goes to an infinite loop.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
问题在于我的shell脚本被用来合并多个小文件。合并后,记录没有正确对准,因为它们没有被视为单独的记录。
由于我正在处理
json
数据集,因此我使用jq
实用程序对其进行处理。以下是将以更快的方式合并大量记录
将大量记录合并到一个文件中:以后,我能够按照预期的预期加载
json
记录以下代码:The problem was with my shell script that was being used to merge multiple small files. Post merge, records weren't aligned properly due to which they weren't treated as separate records.
Since I was dealing with a
JSON
dataset, I used thejq
utility to process it. Below is the shell script that would merge a large number of records in afaster way
into one file:Later on, I was able to load the
JSON
records as expected with the below code:过去,我遇到了数千个小文件的麻烦。在我的情况下,他们的CSV文件不是JSON。我要尝试调试的HTE之一是创建一个用于循环并加载较小的批次,然后将所有数据框架组合在一起。在每次迭代期间,我都会呼吁迫使执行。我会记录进度,以了解它正在取得进步。并监视随着工作的进展,它的放缓方式
I have run into trouble in the past working with thousands of small files. In my case they where csv files not json. One of hte thing I did to try and debug was to create a for loop and load smaller batches then combine all the data frame together. During each iteration I would call an action to force the execution. I would log the progress to get an idea of it was making progress . And monitor how it was slowing down as the job progressed