将CSV文件与不同结构相结合的最佳工具?

发布于 2025-02-04 06:02:09 字数 762 浏览 3 评论 0原文

我有多个大型CSV文件。这些CSV文件几乎没有列的差异。为了将它们馈送到AWS快速视图以进行数据可视化,我想统一这些CSV文件的结构。我认为这样做有两种方法:

  • 将缺少的列添加到每个CSV文件中,因此所有这些都看起来相同的
  • 将所有CSV文件组合到一个大文件中,

什么是最佳的工具?

是否有任何工具可以显示两个CSV文件的结构差异?如果我发现缺少哪些列,我也可以手动添加它们。

使用PANDAS,我可以将CSV文件组合起来,但是以我所知的方式,我应该命名所有列(下面代码),这没有用。

import pandas as pd

df1 = pd.DataFrame({'column1': [1,2],
                    'column2': [3,4],
                    })

df2 = pd.DataFrame({'column1': [5,6],
                    'column3': [7,8],
                    })
pd.concat([df1,df2],ignore_index=True)

结果:

   column1  column2  column3
0        1      3.0      NaN
1        2      4.0      NaN
2        5      NaN      7.0
3        6      NaN      8.0

I have multiple large CSV files. These CSV files have few column differences. For feeding them to AWS QuickSight for data visualization, I want to unify the structure of these CSV files. I think for doing this I have two ways:

  • Add the missing columns to each CSV file so all of them look the same
  • Combine all the CSV files into one large file

What is the best tool for doing this?

Is there any tool that can show structural difference of two CSV file? If I find out which columns are missing I can also add them manually.

With pandas I can combine the CSV files, but in the way I know, I should name all the columns (code below) and this is not useful.

import pandas as pd

df1 = pd.DataFrame({'column1': [1,2],
                    'column2': [3,4],
                    })

df2 = pd.DataFrame({'column1': [5,6],
                    'column3': [7,8],
                    })
pd.concat([df1,df2],ignore_index=True)

Result:

   column1  column2  column3
0        1      3.0      NaN
1        2      4.0      NaN
2        5      NaN      7.0
3        6      NaN      8.0

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

半城柳色半声笛 2025-02-11 06:02:09

我不能告诉你什么“最好的” 工具是什么;这是许多依赖性的主观。

我可以告诉你, Miller >用于考虑使用CSV数据的工具。另请参见 miller github网站。最后一件事:作者是超级 - 螺旋。

我拥有以下工作的良好权限,以下将完成这项工作:

mlr - -CSV reshape -r“^a” -o item,value talue inthe reshape -s item,value \ thut nortparsify -fill -fill -with''“” *.csv> result.csv

有关命令的一些注释:

  • reshape -r“^a” -o item,value,以将输入csvs从宽度转换为长,将其应用于所有名称以“ a”开头的字段;
  • 重塑-s项目,价值,以将先前的输出从长时间转换为宽;
  • 不超越 - 填充“”,以在所有输入记录上管理字段名称。对于给定记录中缺少的字段名称,但存在于其他记录中,则填写值“”的值。

I cannot tell you what "the best" tool is; that's subjective with many dependencies.

I can tell you that miller should probably be on your short list for tools to consider for working with CSV data. Also see the miller GitHub site. One last thing: the author is super-helpful.

I have it on good authority that the following will do the job:

mlr --csv reshape -r "^A" -o item,value then reshape -s item,value \ then unsparsify --fill-with "" *.csv > result.csv

Some notes about the command:

  • reshape -r "^A" -o item,value, to transform the input CSVs from wide to long, applying this to all the fields whose name begins with "A";
  • reshape -s item,value, to transform the previous output from long to wide;
  • unsparsify --fill-with "", to manage field names over all input records. For field names absent in a given record but present in others, fills in the value "".
七秒鱼° 2025-02-11 06:02:09

尽管我喜欢使用Miller解决这个问题,但我也很喜欢使用Panda结合CSV文件。这是实现它的Python代码:

import pandas as pd
import glob
import os

path = os.getcwd() # use your path
# read all the files' name
all_files = glob.glob(os.path.join(path , "*.csv"))

li = []

# read csv files and create a DataFrame list
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

# join all the DataFrames
result_frame = pd.concat(li, ignore_index=True)

# export the result to a csv file
result_frame.to_csv('endresult.csv')

As much as I enjoyed using miller to solve this problem, I also did enjoy using panda to combine CSV files. Here is the python code to achieve it:

import pandas as pd
import glob
import os

path = os.getcwd() # use your path
# read all the files' name
all_files = glob.glob(os.path.join(path , "*.csv"))

li = []

# read csv files and create a DataFrame list
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

# join all the DataFrames
result_frame = pd.concat(li, ignore_index=True)

# export the result to a csv file
result_frame.to_csv('endresult.csv')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文