解析错误的 CSV

发布于 2024-11-25 12:42:04 字数 420 浏览 0 评论 0原文

我有一个“坏”的制表符分隔文件需要清理。问题在于字段可能有换行符。我认为解决此问题的最简单方法是用某种替换字符(例如空格)替换“错误”换行符。现在我可以想象一种方法来做到这一点,如果一行上应该有 n 个字段(伪代码)

var line = read n-1 fields ending in a tab, and then until the end of line
line.replace("\n", " ")
line.replace("\r", " ")
write line to output

现在这些文件很大,并且不能选择读取它们。这是一个合理的做法吗? (我知道这会在最后一个字段中出现换行符,但我愿意接受这一点) 读取足够数据的好方法是什么?我不太关心它使用哪种语言,但更喜欢 .net、perl 或 python2,因为我有可用的运行时。

I've got a "bad" tab separated file that I need to clean up. The problem lies in the fact that fields might have linebreaks. I think the easiest way to fix this is to replace the 'wrong' linebreaks with some sort of replacement character, say a space. now I can imagine a way to do this, if there are supposed to be n fields on a line would be (pseudocode)

var line = read n-1 fields ending in a tab, and then until the end of line
line.replace("\n", " ")
line.replace("\r", " ")
write line to output

Now these files are huge, and slurping them is not an option. Is this a reasonable approach? (I know this will trip over linebreaks in the last field, but I'm willing to live with that)
What would be a good way to read enough data? I don't care much which language it's in, but prefer .net, perl or python2 as I have runtimes for those available.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

蓝海 2024-12-02 12:42:04

你可以用一个非常快速的 awk 脚本来做到这一点:

awk -F\t '{while(NF < (numberoffields) { line=$0; getline; $0 = line $0;} print}' 

you can do this in a really quick awk script:

awk -F\t '{while(NF < (numberoffields) { line=$0; getline; $0 = line $0;} print}' 
沉鱼一梦 2024-12-02 12:42:04

Python 解决方案:

csv_filename = 'foo.csv'
new_csv_filename = 'foo.fixed.csv'
num_fields = 10

with open(csv_filename, 'rU') as reader and open(new_csv_filename, 'w') as writer:
    while True:
        line = ''
        while len(line.split('\t')) < num_fields:
            line += reader.readline().replace('\n', ' ')
        writer.write(line + '\n')  # Or '\r\n' if you prefer

我不会自动进行文件替换;确保保留原件。

A Python solution:

csv_filename = 'foo.csv'
new_csv_filename = 'foo.fixed.csv'
num_fields = 10

with open(csv_filename, 'rU') as reader and open(new_csv_filename, 'w') as writer:
    while True:
        line = ''
        while len(line.split('\t')) < num_fields:
            line += reader.readline().replace('\n', ' ')
        writer.write(line + '\n')  # Or '\r\n' if you prefer

I wouldn't make the file replacement automatic; make sure you keep the original.

不疑不惑不回忆 2024-12-02 12:42:04

我不确定这是否是提出这个问题的正确论坛,但您需要一个文本编辑器程序,例如 TextWrangler(适用于 Mac OSX)。这可以处理大型数据集并进行一些非常复杂的搜索和替换。

我猜一定有一个与 PC 相当的程序。

归根结底,CSV 文件基本上是文本文件,因此您需要通过它来解决问题。

I'm not sure if this is the right forum to ask this question, but you need a text editor program like TextWrangler (for Mac OSX). This can handle large datasets and do some pretty sophisticated search and replace.

There must be a PC equivalent program I guess.

CSV files are basically text files at the end of the day, so that's what you need to take the donkey work out of the issue.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文