Ruby:如何处理带有“坏逗号”的 CSV 文件?

发布于 2024-09-27 15:47:06 字数 290 浏览 3 评论 0原文

我需要处理来自 FedEx.com 的包含运输历史记录的 CSV 文件。不幸的是,FedEx 似乎并没有实际测试其 CSV 文件,因为它没有引用其中包含逗号的字符串。

例如,公司名称可能是“Dog Widgets, Inc.”。但 CSV 不引用该字符串,因此任何 CSV 解析器都会认为“Inc.”之前的逗号。是一个新领域的开始。

有什么方法可以使用 Ruby 可靠地解析这些行吗?

我能发现的唯一区别特征是作为字符串一部分的逗号后面有一个空格。分隔字段的逗号没有空格。不知道这如何帮助我解析这个,但这是我注意到的。

I need to process a CSV file from FedEx.com containing shipping history. Unfortunately FedEx doesn't seem to actually test its CSV files as it doesn't quote strings that have commas in them.

For instance, a company name might be "Dog Widgets, Inc." but the CSV doesn't quote that string, so any CSV parser thinks that comma before "Inc." is the start of a new field.

Is there any way I can reliably parse those rows using Ruby?

The only differentiating characteristic that I can find is that the commas that are part of a string have a space after then. Commas that separate fields have no spaces. No clue how that helps me parse this, but it is something I noticed.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

玉环 2024-10-04 15:47:06

你可以使用负前瞻

>> "foo,bar,baz,pop, blah,foobar".split(/,(?![ \t])/)
=> ["foo", "bar", "baz", "pop, blah", "foobar"]

you can use a negative lookahead

>> "foo,bar,baz,pop, blah,foobar".split(/,(?![ \t])/)
=> ["foo", "bar", "baz", "pop, blah", "foobar"]
月亮是我掰弯的 2024-10-04 15:47:06

好吧,这里有一个想法:您可以将逗号后跟空格的每个实例替换为唯一字符,然后照常解析 CSV,然后遍历结果行并反转替换。

Well, here's an idea: You could replace each instance of comma-followed-by-a-space with a unique character, then parse the CSV as usual, then go through the resulting rows and reverse the replace.

羁拥 2024-10-04 15:47:06

也许沿着这些思路..

使用 gsub 将 ', ' 更改为其他内容

ruby-1.9.2-p0 > "foo,bar,baz,pop, blah,foobar".gsub(/,\ /,'| ').split(',')
[
    [0] "foo",
    [1] "bar",
    [2] "baz",
    [3] "pop| blah",
    [4] "foobar"
]

,然后删除 |话后。

Perhaps something along these lines..

using gsub to change the ', ' to something else

ruby-1.9.2-p0 > "foo,bar,baz,pop, blah,foobar".gsub(/,\ /,'| ').split(',')
[
    [0] "foo",
    [1] "bar",
    [2] "baz",
    [3] "pop| blah",
    [4] "foobar"
]

and then remove the | after words.

め七分饶幸 2024-10-04 15:47:06

如果您很幸运只有一个这样的字段,您可以从头开始解析前导字段,然后从末尾解析尾随字段,并假设剩下的内容是有问题的字段。在Python(没有habla ruby​​)中,这看起来像:

fields = line.split(',') # doesn't work if some fields are quoted
fields = fields[:5] + [','.join(fields[5:-3])] + fields[-3:]

无论你做什么,你至少应该能够确定有问题的逗号的数量,这应该给你一些东西(如果没有其他的话,进行健全性检查)。

If you are so lucky as to only have one field like that, you can parse the leading fields off the start, the trailing fields off than end and assume whatever is left is the offending field. In python (no habla ruby) this would look something like:

fields = line.split(',') # doesn't work if some fields are quoted
fields = fields[:5] + [','.join(fields[5:-3])] + fields[-3:]

Whatever you do, you should be able at a minimum determine the number of offending commas and that should give you something (a sanity check if nothing else).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文