使用Pyspark处理大量小型JSON文件

发布于 2025-01-24 04:40:46 字数 692 浏览 2 评论 0原文

我在376K json文件中有s3的目录下的文件。这些文件是2.5 kb每个文件,仅包含一个记录/文件。当我尝试通过以下代码通过胶水ETL使用20工人

spark.read.json("path")  

它没有运行时。 5小时后有一个超时。因此,我开发并运行了shell脚本以在一个文件下合并这些文件的记录,当我尝试加载它时,它只是显示一个记录。合并的文件大小为980 MB。在单个文件中合并了这4个记录后,在本地测试时,它对4个记录工作正常。它显示了预期的4个记录。

我使用以下命令在一个文件下从不同文件中附加JSON记录:

for f in Agent/*.txt; do cat ${f} >> merged.json;done;  

它没有任何嵌套的JSON。我什至尝试了Multiline选项,但不起作用。那么,在这种情况下该怎么办?按照我的态度,当合并时,它不会单独处理记录,从而导致问题。我什至尝试了head -n 10显示前10行,但它转到了无限循环。

I have around 376K of JSON files under a directory in S3. These files are 2.5 KB each and contain only a single record/file. When I tried to load the entire directory via the below code via Glue ETL with 20 workers:

spark.read.json("path")  

It just didn't run. There was a Timeout after 5 hrs. So, I developed and ran a shell script to merge the records of these files under a single file, and when I tried to load it, it just displays a single record. The merged file size is 980 MB. It worked fine for 4 records when tested locally after merging those 4 records under a single file. It displayed 4 records as expected.

I used the below command to append the JSON records from different files under a single file:

for f in Agent/*.txt; do cat ${f} >> merged.json;done;  

It doesn't have any nested JSON. I even tried the multiline option but didn't work. So, what could be done in this case? As per me, when merged it is not treating records separately hence causing the issue. I even tried head -n 10 to display the top 10 lines but it goes to an infinite loop.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

瞳孔里扚悲伤 2025-01-31 04:40:46

问题在于我的shell脚本被用来合并多个小文件。合并后,记录没有正确对准,因为它们没有被视为单独的记录。

由于我正在处理json数据集,因此我使用jq实用程序对其进行处理。以下是将以更快的方式合并大量记录将大量记录合并到一个文件中:

find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt  

以后,我能够按照预期的预期加载json记录以下代码:

spark.read.option("multiline","true").json("path")

The problem was with my shell script that was being used to merge multiple small files. Post merge, records weren't aligned properly due to which they weren't treated as separate records.

Since I was dealing with a JSON dataset, I used the jq utility to process it. Below is the shell script that would merge a large number of records in a faster way into one file:

find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt  

Later on, I was able to load the JSON records as expected with the below code:

spark.read.option("multiline","true").json("path")
甜中书 2025-01-31 04:40:46

过去,我遇到了数千个小文件的麻烦。在我的情况下,他们的CSV文件不是JSON。我要尝试调试的HTE之一是创建一个用于循环并加载较小的批次,然后将所有数据框架组合在一起。在每次迭代期间,我都会呼吁迫使执行。我会记录进度,以了解它正在取得进步。并监视随着工作的进展,它的放缓方式

I have run into trouble in the past working with thousands of small files. In my case they where csv files not json. One of hte thing I did to try and debug was to create a for loop and load smaller batches then combine all the data frame together. During each iteration I would call an action to force the execution. I would log the progress to get an idea of it was making progress . And monitor how it was slowing down as the job progressed

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文