解析错误的 CSV

发布于 2024-11-25 12:42:04 字数 420 浏览 0 评论 0原文

我有一个“坏”的制表符分隔文件需要清理。问题在于字段可能有换行符。我认为解决此问题的最简单方法是用某种替换字符（例如空格）替换“错误”换行符。现在我可以想象一种方法来做到这一点，如果一行上应该有 n 个字段（伪代码）

var line = read n-1 fields ending in a tab, and then until the end of line
line.replace("\n", " ")
line.replace("\r", " ")
write line to output

现在这些文件很大，并且不能选择读取它们。这是一个合理的做法吗？（我知道这会在最后一个字段中出现换行符，但我愿意接受这一点）读取足够数据的好方法是什么？我不太关心它使用哪种语言，但更喜欢 .net、perl 或 python2，因为我有可用的运行时。

原文

I've got a "bad" tab separated file that I need to clean up. The problem lies in the fact that fields might have linebreaks. I think the easiest way to fix this is to replace the 'wrong' linebreaks with some sort of replacement character, say a space. now I can imagine a way to do this, if there are supposed to be n fields on a line would be (pseudocode)

var line = read n-1 fields ending in a tab, and then until the end of line
line.replace("\n", " ")
line.replace("\r", " ")
write line to output

Now these files are huge, and slurping them is not an option. Is this a reasonable approach? (I know this will trip over linebreaks in the last field, but I'm willing to live with that)
What would be a good way to read enough data? I don't care much which language it's in, but prefer .net, perl or python2 as I have runtimes for those available.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蓝海 2024-12-02 12:42:04

你可以用一个非常快速的 awk 脚本来做到这一点：

awk -F\t '{while(NF < (numberoffields) { line=$0; getline; $0 = line $0;} print}'

you can do this in a really quick awk script:

awk -F\t '{while(NF < (numberoffields) { line=$0; getline; $0 = line $0;} print}'

回复收藏 0 原文

沉鱼一梦 2024-12-02 12:42:04

Python 解决方案：

csv_filename = 'foo.csv'
new_csv_filename = 'foo.fixed.csv'
num_fields = 10

with open(csv_filename, 'rU') as reader and open(new_csv_filename, 'w') as writer:
    while True:
        line = ''
        while len(line.split('\t')) < num_fields:
            line += reader.readline().replace('\n', ' ')
        writer.write(line + '\n')  # Or '\r\n' if you prefer

我不会自动进行文件替换；确保保留原件。

A Python solution:

csv_filename = 'foo.csv'
new_csv_filename = 'foo.fixed.csv'
num_fields = 10

with open(csv_filename, 'rU') as reader and open(new_csv_filename, 'w') as writer:
    while True:
        line = ''
        while len(line.split('\t')) < num_fields:
            line += reader.readline().replace('\n', ' ')
        writer.write(line + '\n')  # Or '\r\n' if you prefer

I wouldn't make the file replacement automatic; make sure you keep the original.

回复收藏 0 原文