解析错误的 CSV
我有一个“坏”的制表符分隔文件需要清理。问题在于字段可能有换行符。我认为解决此问题的最简单方法是用某种替换字符(例如空格)替换“错误”换行符。现在我可以想象一种方法来做到这一点,如果一行上应该有 n 个字段(伪代码)
var line = read n-1 fields ending in a tab, and then until the end of line
line.replace("\n", " ")
line.replace("\r", " ")
write line to output
现在这些文件很大,并且不能选择读取它们。这是一个合理的做法吗? (我知道这会在最后一个字段中出现换行符,但我愿意接受这一点) 读取足够数据的好方法是什么?我不太关心它使用哪种语言,但更喜欢 .net、perl 或 python2,因为我有可用的运行时。
I've got a "bad" tab separated file that I need to clean up. The problem lies in the fact that fields might have linebreaks. I think the easiest way to fix this is to replace the 'wrong' linebreaks with some sort of replacement character, say a space. now I can imagine a way to do this, if there are supposed to be n fields on a line would be (pseudocode)
var line = read n-1 fields ending in a tab, and then until the end of line
line.replace("\n", " ")
line.replace("\r", " ")
write line to output
Now these files are huge, and slurping them is not an option. Is this a reasonable approach? (I know this will trip over linebreaks in the last field, but I'm willing to live with that)
What would be a good way to read enough data? I don't care much which language it's in, but prefer .net, perl or python2 as I have runtimes for those available.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
你可以用一个非常快速的 awk 脚本来做到这一点:
you can do this in a really quick awk script:
Python 解决方案:
我不会自动进行文件替换;确保保留原件。
A Python solution:
I wouldn't make the file replacement automatic; make sure you keep the original.
我不确定这是否是提出这个问题的正确论坛,但您需要一个文本编辑器程序,例如 TextWrangler(适用于 Mac OSX)。这可以处理大型数据集并进行一些非常复杂的搜索和替换。
我猜一定有一个与 PC 相当的程序。
归根结底,CSV 文件基本上是文本文件,因此您需要通过它来解决问题。
I'm not sure if this is the right forum to ask this question, but you need a text editor program like TextWrangler (for Mac OSX). This can handle large datasets and do some pretty sophisticated search and replace.
There must be a PC equivalent program I guess.
CSV files are basically text files at the end of the day, so that's what you need to take the donkey work out of the issue.