Teradata TPT 加载脚本性能
我希望有人可以帮助我改进 Teradata TPT 加载脚本。我正在使用以下脚本将 3GB 分隔的 CSV 文件加载到 Teradata 中。该文件驻留在我的本地笔记本电脑硬盘上。该文件加载大约需要 30 分钟,相当长。要加载的总行数约为 3000 万行。有关性能改进的任何建议。
DEFINE JOB LOAD_TD_FROM_CSV
DESCRIPTION 'Load Teradata table from CSV File'
(
DEFINE SCHEMA FOOD_TPT /*Define Table schema*/
DESCRIPTION 'FOOD_TPT'
(
col1 VARCHAR(20),
col2 VARCHAR(100),
col3 VARCHAR(100),
col4 VARCHAR(100),
col5 VARCHAR(100),
col6 VARCHAR(100),
col7 VARCHAR(100),
col8 VARCHAR(100)
);
DEFINE OPERATOR DDL_OPERATOR
TYPE DDL
ATTRIBUTES
(
VARCHAR TdpId = 'system', /*System Name*/
VARCHAR UserName = 'user', /*USERNAME*/
VARCHAR UserPassword = 'password', /*Password*/
VARCHAR Errorlist ='3807' /*This is added to skip 'Table does not exist error' and treat it as warnig*/
);
DEFINE OPERATOR LOAD_CSV /*Load information*/
DESCRIPTION 'Operator to Load CSV Data'
TYPE LOAD
SCHEMA *
ATTRIBUTES
(
VARCHAR PrivateLogName,
VARCHAR TraceLevel = 'None',
INTEGER TenacityHours = 1,
INTEGER TenacitySleep = 1,
INTEGER MaxSessions = 4,
INTEGER MinSessions = 1,
INTEGER BUFFERSIZE =16,
VARCHAR TargetTable = 'FOOD_TPT_STG', /*Define target table name where the file will be loaded*/
VARCHAR LogTable = 'FOOD_TPT_LOG', /*Define Log table name*/
VARCHAR ErrorTable1 = 'FOOD_TPT_STG_E1',/*There are 2 error tables. Define them. First table is _ET table*/
VARCHAR ErrorTable2 = 'FOOD_TPT_STG_E2', /*Define _UV table*/
VARCHAR TdpId = 'system', /*System Name*/
VARCHAR UserName = 'user', /*Username*/
VARCHAR UserPassword = 'password' /*Password*/
);
DEFINE OPERATOR READ_CSV
DESCRIPTION 'Operator to Read CSV File'
TYPE DATACONNECTOR PRODUCER
SCHEMA FOOD_TPT
ATTRIBUTES
(
VARCHAR Filename = 'file.csv' /*give file name with path*/
,VARCHAR Format = 'Delimited'
,VARCHAR TextDelimiter = ','
,VARCHAR AcceptExcessColumns = 'N'
,VARCHAR PrivateLogName = 'LOAD_FROM_CSV'
,Integer SkipRows=1 /*skips the header in csv file*/
);
Step Setup_Tables /*Enter all executable SQLs in this step*/
(
APPLY
('Drop table FOOD_TPT_STG_E1;'), /*Drop error tables*/
('Drop table FOOD_TPT_STG_E2;'),
('Drop table FOOD_TPT_LOG;'), /*Drop Log Table*/
('Drop table FOOD_TPT_STG;'), /*Drop Target staging tables*/
('CREATE TABLE FOOD_TPT_STG ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
datablocksize= 1022 kbytes,
DEFAULT MERGEBLOCKRATIO
(
col1 VARCHAR(20),
col2 VARCHAR(100),
col3 VARCHAR(100),
col4 VARCHAR(100),
col5 VARCHAR(100),
col6 VARCHAR(100),
col7 VARCHAR(100),
col8 VARCHAR(100)
)
NO PRIMARY INDEX;') /*Create Target table*/
TO OPERATOR (DDL_OPERATOR);
);
Step Load_Table
(
APPLY ('INSERT INTO FOOD_RISK_TPT_STG
(
:col1
,:col2
,:col3
,:col4
,:col5
,:col6
,:col7
,:col8
);') /*Inserts records from CSV file into Target Table*/
TO OPERATOR (LOAD_CSV)
SELECT * FROM operator(READ_CSV);
);
);
提前致谢
I hope someone can help me with the improvement for Teradata TPT load Script. I am using the below script to load a 3GB delimited CSV file into Teradata. The file resides on my local laptop hard drive. The file takes approximately 30 minutes to load which is quite long. The total number of rows to be loaded are approximately 30 million. Any recommendations on performance improvement.
DEFINE JOB LOAD_TD_FROM_CSV
DESCRIPTION 'Load Teradata table from CSV File'
(
DEFINE SCHEMA FOOD_TPT /*Define Table schema*/
DESCRIPTION 'FOOD_TPT'
(
col1 VARCHAR(20),
col2 VARCHAR(100),
col3 VARCHAR(100),
col4 VARCHAR(100),
col5 VARCHAR(100),
col6 VARCHAR(100),
col7 VARCHAR(100),
col8 VARCHAR(100)
);
DEFINE OPERATOR DDL_OPERATOR
TYPE DDL
ATTRIBUTES
(
VARCHAR TdpId = 'system', /*System Name*/
VARCHAR UserName = 'user', /*USERNAME*/
VARCHAR UserPassword = 'password', /*Password*/
VARCHAR Errorlist ='3807' /*This is added to skip 'Table does not exist error' and treat it as warnig*/
);
DEFINE OPERATOR LOAD_CSV /*Load information*/
DESCRIPTION 'Operator to Load CSV Data'
TYPE LOAD
SCHEMA *
ATTRIBUTES
(
VARCHAR PrivateLogName,
VARCHAR TraceLevel = 'None',
INTEGER TenacityHours = 1,
INTEGER TenacitySleep = 1,
INTEGER MaxSessions = 4,
INTEGER MinSessions = 1,
INTEGER BUFFERSIZE =16,
VARCHAR TargetTable = 'FOOD_TPT_STG', /*Define target table name where the file will be loaded*/
VARCHAR LogTable = 'FOOD_TPT_LOG', /*Define Log table name*/
VARCHAR ErrorTable1 = 'FOOD_TPT_STG_E1',/*There are 2 error tables. Define them. First table is _ET table*/
VARCHAR ErrorTable2 = 'FOOD_TPT_STG_E2', /*Define _UV table*/
VARCHAR TdpId = 'system', /*System Name*/
VARCHAR UserName = 'user', /*Username*/
VARCHAR UserPassword = 'password' /*Password*/
);
DEFINE OPERATOR READ_CSV
DESCRIPTION 'Operator to Read CSV File'
TYPE DATACONNECTOR PRODUCER
SCHEMA FOOD_TPT
ATTRIBUTES
(
VARCHAR Filename = 'file.csv' /*give file name with path*/
,VARCHAR Format = 'Delimited'
,VARCHAR TextDelimiter = ','
,VARCHAR AcceptExcessColumns = 'N'
,VARCHAR PrivateLogName = 'LOAD_FROM_CSV'
,Integer SkipRows=1 /*skips the header in csv file*/
);
Step Setup_Tables /*Enter all executable SQLs in this step*/
(
APPLY
('Drop table FOOD_TPT_STG_E1;'), /*Drop error tables*/
('Drop table FOOD_TPT_STG_E2;'),
('Drop table FOOD_TPT_LOG;'), /*Drop Log Table*/
('Drop table FOOD_TPT_STG;'), /*Drop Target staging tables*/
('CREATE TABLE FOOD_TPT_STG ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
datablocksize= 1022 kbytes,
DEFAULT MERGEBLOCKRATIO
(
col1 VARCHAR(20),
col2 VARCHAR(100),
col3 VARCHAR(100),
col4 VARCHAR(100),
col5 VARCHAR(100),
col6 VARCHAR(100),
col7 VARCHAR(100),
col8 VARCHAR(100)
)
NO PRIMARY INDEX;') /*Create Target table*/
TO OPERATOR (DDL_OPERATOR);
);
Step Load_Table
(
APPLY ('INSERT INTO FOOD_RISK_TPT_STG
(
:col1
,:col2
,:col3
,:col4
,:col5
,:col6
,:col7
,:col8
);') /*Inserts records from CSV file into Target Table*/
TO OPERATOR (LOAD_CSV)
SELECT * FROM operator(READ_CSV);
);
);
Thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
正如 Fred 所写,您指定 BUFFERSIZE 为 16KB ->每个块 22 行(FastLoad 根据定义的最大大小计算此值),这会导致发送 160 万条消息。删除该属性,默认情况下您将获得 1 MB,即每个块 1400 行。此外,您可能简化您的脚本,如下所示:
并且有 作业变量文件使脚本更好地可重用
As Fred wrote, you specify a BUFFERSIZE of 16KB -> 22 rows per block (FastLoad calculates this based on the defined max size) which results in 1.6 million messages send. Remove the attribute and you get 1 MB as default, 1400 rows per block. Additionally you might simplify your scripts like this:
And there are job variables files to make scripts better reusable