Apache Spark CSV 使用 Windows 换行符 (CRLF) 从 DataFrame 写入
我在基于 Unix 的集群中运行 Apache Spark 3.1.2,为基于 Windows 的摄取系统准备 CSV 文件。当 Windows 系统摄取集群 Spark CSV 导出创建的 CSV 文件时,无法解析 csv,因为新行是 LF
\n
Unix 风格新行,而 Windows系统需要 CRLF
\r\n
样式的行结尾。
有没有办法将 Apache Spark CSV 导出器配置为使用基于 Windows 的新行进行写入,尽管在 Unix 环境中运行?是否有一个 scala 工具可以在 CSV 写入后运行,可以在导出到 Windows 系统之前将文件转换为 Windows 新行?
我已经看到了 .option("lineSep", "\r\n")
但我相信这仅供阅读。
I'm running Apache Spark 3.1.2 in a Unix-based cluster to prepare CSV files for a Windows based ingestion system. When the Windows system ingests the CSV file created by the cluster's Spark CSV export, it fails to parse csv because the new lines are LF
\n
Unix Style new lines while the Windows system is expecting CRLF
\r\n
style line endings.
Is there a way to configure the Apache Spark CSV exporter to write with windows based new lines despite operating in a unix environment? Is there perhaps a scala tool that can be run after the CSV write that can convert the file to windows new lines before export to the windows system?
I've seen the .option("lineSep", "\r\n")
but I believe that's for READING ONLY.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
sed
,perl
,或者只是unix2dos
utilsed
,perl
, or justunix2dos
util我必须对文件进行后处理。我将它合并到 1 个分区并写出 CSV,然后使用 Java BufferedReader 逐行加载文件。我使用 BufferedOutputWriter 将输入流逐行传输到编写器中,在每行之间注入 \r\n ...太蹩脚了。
I had to post-process the file. I coalesced it to 1 partition and wrote out the CSV, then used a Java BufferedReader to load the file line by line. I used a BufferedOutputWriter to then pipe the input stream line by line into the writer, injecting \r\n between each line... SO LAME.