我的 CSV 文件中有数十亿行数据。每行可以有 10 到 20 列。我想使用 COPY FROM 将数据加载到包含 20 列的表中。如果特定的 CSV 行仅包含 10 列数据,那么我希望 COPY FROM 将其余列(缺少值)设置为 NULL。我在 CREATE TABLE 语句中的每一列上指定 DEFAULT NULL。
我的问题:
这可以使用 COPY FROM 来完成吗?
编辑:Greenplum(基于 PostgreSQL 的数据库)有一个名为 FILL MISSING FIELDS 的开关,它执行我所描述的操作(请参阅他们的文档 此处)。您会为 PostgreSQL 推荐哪些解决方法?
I have several billion rows of data in CSV files. Each row can have anything from 10 to 20 columns. I want to use COPY FROM to load the data into a table containing 20 columns. If a specific CSV row only contains 10 columns of data, then I expect COPY FROM to set the rest of the columns (for which the values are missing) to NULL. I specify DEFAULT NULL on every column in the CREATE TABLE statement.
MY QUESTION:
Can this be done using COPY FROM?
EDIT: Greenplum (a database based upon PostgreSQL) has a switch named FILL MISSING FIELDS, which does what I describe (see their documentation here). What workarounds would you recommend for PostgreSQL?
发布评论
评论(4)
编写一个预处理脚本,在没有足够列的行上添加一些额外的逗号,或者将 CSV 转换为 TSV(制表符分隔)并在额外的列中放入“\N”。
Write a pre-processing script to just add some extra commas on the lines that don't have enough columns, or to transform the CSV into TSV (tab-separated) and put "\N" in the extra columns.
我认为您不能使 COPY FROM 处理同一文件中不同数量的列。
如果总是缺少相同的 10 列,解决方法可能是首先将所有内容加载到具有单个
text
列的临时表中。之后,您可以使用 SQL 拆分行并提取列,如下所示:
然后使用 array_length(columns) = 20 执行类似的操作
I don't think you can make COPY FROM deal with different number of columns inside the same file.
If it's always the same 10 columns that are missing, a workaround could be to first load everything into a staging table that has a single
text
column.After that, you can use SQL to split the line and extract the columns, something like this:
and then do a similar thing with
array_length(columns) = 20
在
etl
和data-warehouse
的上下文中 - 我的建议是实际上避免您正在寻找的“快捷方式”。ETL 是一个过程,通常以 ECCD(提取、清理、一致性、交付)的形式实现。您可以将这些文件视为“已提取”,因此只需将数据清理和符合作为不同的步骤即可 - 您将需要一些额外的磁盘空间。所有符合要求的文件都应具有“最终”(所有列)结构。然后交付(
COPY FROM
)那些符合要求的文件。通过这种方式,您还可以记录 ETL 过程以及每个步骤中缺失字段发生的情况。
通常的做法是存档(磁盘、DVD)原始客户文件和一致版本以用于审核和调试目的。
In a context of
etl
anddata-warehouse
-- my suggestion would be to actually avoid the "shortcut" you are looking for.ETL is a process, frequently implemented as ECCD (Extract, Clean, Conform, Deliver). You could treat those files as "Extracted", so simply implement data cleaning and conforming as different steps -- you will need some extra disk space for that. All conformed files should have the "final" (all columns) structure. Then deliver (
COPY FROM
) those conformed files.This way you will also be able to document the ETL process and what happens to the missing fields in each step.
It is a usual practice to archive (disk, DVD) original customer files and conformed versions for audit and debug purposes.
来自 PostgreSQL 手册:
阅读 CSV 文件的第一行,了解必须在 COPY 语句中命名的列数。
From the PostgreSQL manual:
Read the first line of your CSV file to see how many columns you have to name in the COPY statement.