Postgresql:从 csv 文件复制,偶尔缺少列

发布于 2024-10-09 22:34:45 字数 434 浏览 1 评论 0 原文

我的 CSV 文件中有数十亿行数据。每行可以有 10 到 20 列。我想使用 COPY FROM 将数据加载到包含 20 列的表中。如果特定的 CSV 行仅包含 10 列数据,那么我希望 COPY FROM 将其余列(缺少值)设置为 NULL。我在 CREATE TABLE 语句中的每一列上指定 DEFAULT NULL。

我的问题: 这可以使用 COPY FROM 来完成吗?

编辑:Greenplum(基于 PostgreSQL 的数据库)有一个名为 FILL MISSING FIELDS 的开关,它执行我所描述的操作(请参阅他们的文档 此处)。您会为 PostgreSQL 推荐哪些解决方法?

I have several billion rows of data in CSV files. Each row can have anything from 10 to 20 columns. I want to use COPY FROM to load the data into a table containing 20 columns. If a specific CSV row only contains 10 columns of data, then I expect COPY FROM to set the rest of the columns (for which the values are missing) to NULL. I specify DEFAULT NULL on every column in the CREATE TABLE statement.

MY QUESTION:
Can this be done using COPY FROM?

EDIT: Greenplum (a database based upon PostgreSQL) has a switch named FILL MISSING FIELDS, which does what I describe (see their documentation here). What workarounds would you recommend for PostgreSQL?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

長街聽風 2024-10-16 22:34:45

编写一个预处理脚本,在没有足够列的行上添加一些额外的逗号,或者将 CSV 转换为 TSV(制表符分隔)并在额外的列中放入“\N”。

Write a pre-processing script to just add some extra commas on the lines that don't have enough columns, or to transform the CSV into TSV (tab-separated) and put "\N" in the extra columns.

风情万种。 2024-10-16 22:34:45

我认为您不能使 COPY FROM 处理同一文件中不同数量的列。

如果总是缺少相同的 10 列,解决方法可能是首先将所有内容加载到具有单个 text 列的临时表中。

之后,您可以使用 SQL 拆分行并提取列,如下所示:

INSERT INTO target_table (col1, col2, col3, col4, col5, ...)
SELECT columns[1], columns[2], ...
FROM ( 
  SELECT string_to_array(big_column, ',') as columns
    FROM staging_table 
) t
WHERE array_length(columns) = 10

然后使用 array_length(columns) = 20 执行类似的操作

I don't think you can make COPY FROM deal with different number of columns inside the same file.

If it's always the same 10 columns that are missing, a workaround could be to first load everything into a staging table that has a single text column.

After that, you can use SQL to split the line and extract the columns, something like this:

INSERT INTO target_table (col1, col2, col3, col4, col5, ...)
SELECT columns[1], columns[2], ...
FROM ( 
  SELECT string_to_array(big_column, ',') as columns
    FROM staging_table 
) t
WHERE array_length(columns) = 10

and then do a similar thing with array_length(columns) = 20

初雪 2024-10-16 22:34:45

etldata-warehouse 的上下文中 - 我的建议是实际上避免您正在寻找的“快捷方式”。

ETL 是一个过程,通常以 ECCD(提取、清理、一致性、交付)的形式实现。您可以将这些文件视为“已提取”,因此只需将数据清理和符合作为不同的步骤即可 - 您将需要一些额外的磁盘空间。所有符合要求的文件都应具有“最终”(所有列)结构。然后交付(COPY FROM)那些符合要求的文件。

通过这种方式,您还可以记录 ETL 过程以及每个步骤中缺失字段发生的情况。

通常的做法是存档(磁盘、DVD)原始客户文件和一致版本以用于审核和调试目的。

In a context of etl and data-warehouse -- my suggestion would be to actually avoid the "shortcut" you are looking for.

ETL is a process, frequently implemented as ECCD (Extract, Clean, Conform, Deliver). You could treat those files as "Extracted", so simply implement data cleaning and conforming as different steps -- you will need some extra disk space for that. All conformed files should have the "final" (all columns) structure. Then deliver (COPY FROM) those conformed files.

This way you will also be able to document the ETL process and what happens to the missing fields in each step.

It is a usual practice to archive (disk, DVD) original customer files and conformed versions for audit and debug purposes.

悲凉≈ 2024-10-16 22:34:45

来自 PostgreSQL 手册

COPY FROM 将引发错误(如果有)
输入文件的行包含更多
或比预期少的列。

阅读 CSV 文件的第一行,了解必须在 COPY 语句中命名的列数。

From the PostgreSQL manual:

COPY FROM will raise an error if any
line of the input file contains more
or fewer columns than are expected.

Read the first line of your CSV file to see how many columns you have to name in the COPY statement.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文