Excel 和 awk 对 CSV 总计的看法不一致

发布于 2024-10-02 04:14:48 字数 668 浏览 2 评论 0原文

我有一个 CSV 文件，我通过两种方式汇总：一种使用 Excel，另一种使用 awk。以下是 Excel 中前 8 列的总计：

1) 2640502474.00
2) 1272849386284.00
3) 36785.00
4) 
5) 107.00
6) 239259.00
7) 0.00
8) 7418570893330.00

这是我的 awk 输出：

$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$1} END {printf("%01.2f\n", s)}'
2640502474.00
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$2} END {printf("%01.2f\n", s)}'
1272849386284.00
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$8} END {printf("%01.2f\n", s)}'
7411306364347.00

请注意 1 和 2 完全匹配，但 8 相差数百万。我假设 Excel 的总数是正确的，那么为什么 awk 会以不同的方式处理该文件呢？

原文

I have a CSV file that I'm totaling up two ways: one using Excel and the other using awk. Here are the totals of my first 8 columns in Excel:

1) 2640502474.00
2) 1272849386284.00
3) 36785.00
4) 
5) 107.00
6) 239259.00
7) 0.00
8) 7418570893330.00

And here's my awk output:

$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$1} END {printf("%01.2f\n", s)}'
2640502474.00
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$2} END {printf("%01.2f\n", s)}'
1272849386284.00
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$8} END {printf("%01.2f\n", s)}'
7411306364347.00

Notice how 1 and 2 match exactly but 8 is off by many millions. I'm assuming Excel's total is the correct one, so why is awk handling this file differently?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

江南烟雨〆相思醉 2024-10-09 04:14:48

您可能在引号中包含逗号格式的数字。 Excel 会将该数字作为单个字段正确处理。 awk 中用于字段分隔的正则表达式不会 - 根据该正则表达式，数字内部的逗号是有效的分隔符。尝试处理可选的嵌套转义是非常困难的（而且大多是徒劳的），就像使用正则表达式在 csv 中可能实现的那样。

比较以下内容以了解可能发生的情况：

$ echo '"1","10","15","1,000","14"' | awk -F "\"*,\"*" '{print $4}'
1
$ echo '"1","10","15","1,000","14"' | awk -F "\",\"" '{print $4}'
1,000

请注意，上面的第二个正则表达式仍然存在最后一个字段中尾随 " 的问题，并且只有在所有字段都一致引用时才有效 - 它仅用于说明目的。

You likely have a comma formatted number contained in quotes. Excel will properly handle that number as a single field. Your regex for field separation in awk won't - a comma internal to a number is a valid separator according to that regex. It is very hard (and mostly futile) to try and handle optional nested escaping like what is possible in csv with a regex.

Compare the following to see what is likely going on:

$ echo '"1","10","15","1,000","14"' | awk -F "\"*,\"*" '{print $4}'
1
$ echo '"1","10","15","1,000","14"' | awk -F "\",\"" '{print $4}'
1,000

Note that the second regex above still has a problem with a trailing " in the last field and only works at all if all field are consistently quoted - it is for illustration purposes only.

回复收藏 0 原文

~没有更多了~