将CSV文件与不同结构相结合的最佳工具？

发布于 2025-02-04 06:02:09 字数 762 浏览 3 评论 0原文

我有多个大型CSV文件。这些CSV文件几乎没有列的差异。为了将它们馈送到AWS快速视图以进行数据可视化，我想统一这些CSV文件的结构。我认为这样做有两种方法：

将缺少的列添加到每个CSV文件中，因此所有这些都看起来相同的
将所有CSV文件组合到一个大文件中，

什么是最佳的工具？

是否有任何工具可以显示两个CSV文件的结构差异？如果我发现缺少哪些列，我也可以手动添加它们。

使用PANDAS，我可以将CSV文件组合起来，但是以我所知的方式，我应该命名所有列（下面代码），这没有用。

import pandas as pd

df1 = pd.DataFrame({'column1': [1,2],
                    'column2': [3,4],
                    })

df2 = pd.DataFrame({'column1': [5,6],
                    'column3': [7,8],
                    })
pd.concat([df1,df2],ignore_index=True)

结果：

   column1  column2  column3
0        1      3.0      NaN
1        2      4.0      NaN
2        5      NaN      7.0
3        6      NaN      8.0

原文

I have multiple large CSV files. These CSV files have few column differences. For feeding them to AWS QuickSight for data visualization, I want to unify the structure of these CSV files. I think for doing this I have two ways:

Add the missing columns to each CSV file so all of them look the same
Combine all the CSV files into one large file

What is the best tool for doing this?

Is there any tool that can show structural difference of two CSV file? If I find out which columns are missing I can also add them manually.

With pandas I can combine the CSV files, but in the way I know, I should name all the columns (code below) and this is not useful.

import pandas as pd

df1 = pd.DataFrame({'column1': [1,2],
                    'column2': [3,4],
                    })

df2 = pd.DataFrame({'column1': [5,6],
                    'column3': [7,8],
                    })
pd.concat([df1,df2],ignore_index=True)

Result:

   column1  column2  column3
0        1      3.0      NaN
1        2      4.0      NaN
2        5      NaN      7.0
3        6      NaN      8.0

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半城柳色半声笛 2025-02-11 06:02:09

我不能告诉你什么“最好的” 工具是什么；这是许多依赖性的主观。

我可以告诉你， Miller >用于考虑使用CSV数据的工具。另请参见 miller github网站。最后一件事：作者是超级 - 螺旋。

我拥有以下工作的良好权限，以下将完成这项工作：

mlr - -CSV reshape -r“^a” -o item，value talue inthe reshape -s item，value \ thut nortparsify -fill -fill -with''“” *.csv＆gt; result.csv

有关命令的一些注释：

reshape -r“^a” -o item，value，以将输入csvs从宽度转换为长，将其应用于所有名称以“ a”开头的字段；
重塑-s项目，价值，以将先前的输出从长时间转换为宽；
不超越 - 填充“”，以在所有输入记录上管理字段名称。对于给定记录中缺少的字段名称，但存在于其他记录中，则填写值“”的值。

回复收藏 0 原文

七秒鱼° 2025-02-11 06:02:09

尽管我喜欢使用Miller解决这个问题，但我也很喜欢使用Panda结合CSV文件。这是实现它的Python代码：

import pandas as pd
import glob
import os

path = os.getcwd() # use your path
# read all the files' name
all_files = glob.glob(os.path.join(path , "*.csv"))

li = []

# read csv files and create a DataFrame list
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

# join all the DataFrames
result_frame = pd.concat(li, ignore_index=True)

# export the result to a csv file
result_frame.to_csv('endresult.csv')

As much as I enjoyed using miller to solve this problem, I also did enjoy using panda to combine CSV files. Here is the python code to achieve it:

import pandas as pd
import glob
import os

path = os.getcwd() # use your path
# read all the files' name
all_files = glob.glob(os.path.join(path , "*.csv"))

li = []

# read csv files and create a DataFrame list
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

# join all the DataFrames
result_frame = pd.concat(li, ignore_index=True)

# export the result to a csv file
result_frame.to_csv('endresult.csv')

回复收藏 0 原文

~没有更多了~