将CSV文件与不同结构相结合的最佳工具?
我有多个大型CSV文件。这些CSV文件几乎没有列的差异。为了将它们馈送到AWS快速视图以进行数据可视化,我想统一这些CSV文件的结构。我认为这样做有两种方法:
- 将缺少的列添加到每个CSV文件中,因此所有这些都看起来相同的
- 将所有CSV文件组合到一个大文件中,
什么是最佳的工具?
是否有任何工具可以显示两个CSV文件的结构差异?如果我发现缺少哪些列,我也可以手动添加它们。
使用PANDAS,我可以将CSV文件组合起来,但是以我所知的方式,我应该命名所有列(下面代码),这没有用。
import pandas as pd
df1 = pd.DataFrame({'column1': [1,2],
'column2': [3,4],
})
df2 = pd.DataFrame({'column1': [5,6],
'column3': [7,8],
})
pd.concat([df1,df2],ignore_index=True)
结果:
column1 column2 column3
0 1 3.0 NaN
1 2 4.0 NaN
2 5 NaN 7.0
3 6 NaN 8.0
I have multiple large CSV files. These CSV files have few column differences. For feeding them to AWS QuickSight for data visualization, I want to unify the structure of these CSV files. I think for doing this I have two ways:
- Add the missing columns to each CSV file so all of them look the same
- Combine all the CSV files into one large file
What is the best tool for doing this?
Is there any tool that can show structural difference of two CSV file? If I find out which columns are missing I can also add them manually.
With pandas I can combine the CSV files, but in the way I know, I should name all the columns (code below) and this is not useful.
import pandas as pd
df1 = pd.DataFrame({'column1': [1,2],
'column2': [3,4],
})
df2 = pd.DataFrame({'column1': [5,6],
'column3': [7,8],
})
pd.concat([df1,df2],ignore_index=True)
Result:
column1 column2 column3
0 1 3.0 NaN
1 2 4.0 NaN
2 5 NaN 7.0
3 6 NaN 8.0
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不能告诉你什么“最好的” 工具是什么;这是许多依赖性的主观。
我可以告诉你, Miller >用于考虑使用CSV数据的工具。另请参见
miller
github网站。最后一件事:作者是超级 - 螺旋。
我拥有以下工作的良好权限,以下将完成这项工作:
mlr - -CSV reshape -r“^a” -o item,value talue inthe reshape -s item,value \ thut nortparsify -fill -fill -with''“” *.csv> result.csv
有关命令的一些注释:
I cannot tell you what "the best" tool is; that's subjective with many dependencies.
I can tell you that miller should probably be on your short list for tools to consider for working with CSV data. Also see the
miller
GitHub site. One last thing: the author is super-helpful.I have it on good authority that the following will do the job:
mlr --csv reshape -r "^A" -o item,value then reshape -s item,value \ then unsparsify --fill-with "" *.csv > result.csv
Some notes about the command:
尽管我喜欢使用Miller解决这个问题,但我也很喜欢使用Panda结合CSV文件。这是实现它的Python代码:
As much as I enjoyed using miller to solve this problem, I also did enjoy using panda to combine CSV files. Here is the python code to achieve it: