SQL*Loader:处理数据中的分隔符

发布于 2024-12-27 22:21:46 字数 604 浏览 1 评论 0原文

我正在通过 SQLLDR 将一些数据加载到 Oracle。源文件是“管道分隔”。

FIELDS TERMINATED BY '|'

但有些记录的数据中包含管道字符,而不是作为分隔符。因此,它会破坏记录的正确加载,因为它将 indata 管道字符理解为字段终止符。

你能给我指出解决这个问题的方向吗?

数据文件约9GB,手动编辑比较困难。

例如,

已加载行:

ABC|1234567|STR 9 R 25|98734959,32|28.12.2011

拒绝行:

DE4|2346543|WE| 454|956584,84|28.11.2011

错误:

Rejected - Error on table HSX, column DATE_N.
ORA-01847: day of month must be between 1 and last day of month

DATE_N 列是最后一列。

I am loading some data to Oracle via SQLLDR. The source file is "pipe delimited".

FIELDS TERMINATED BY '|'

But some records contain pipe character in data, and not as separator. So it breaks correct loading of records as it understands indata pipe characters as field terminator.

Can you point me a direction to solve this issue?

Data file is about 9 GB, so it is hard to edit manually.

For example,

Loaded row:

ABC|1234567|STR 9 R 25|98734959,32|28.12.2011

Rejected Row:

DE4|2346543|WE| 454|956584,84|28.11.2011

Error:

Rejected - Error on table HSX, column DATE_N.
ORA-01847: day of month must be between 1 and last day of month

DATE_N column is the last one.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

婴鹅 2025-01-03 22:21:46

您不能使用任何分隔符,并执行以下操作:

field FILLER,
col1 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)

此正则表达式需要六个由竖线分隔的捕获组(括号内)(我必须转义它,因为否则它在正则表达式中意味着 OR)。除第三组之外的所有组都不能包含竖线 ([^|]*),第三组可以包含任何内容 (.*),并且正则表达式必须从头开始到行尾(^$)。

这样我们就可以确定第三组会吃掉所有多余的分隔符。这只有效,因为您只有一个可能包含分隔符的字段。如果您想进行校对,您可以指定第四组以数字开头(在第四个括号块的开头包含 \d)。

我将所有反斜杠加倍,因为我们位于双引号表达式内,但我不太确定我应该这样做。

, '\\1')", col2 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)

此正则表达式需要六个由竖线分隔的捕获组(括号内)(我必须转义它,因为否则它在正则表达式中意味着 OR)。除第三组之外的所有组都不能包含竖线 ([^|]*),第三组可以包含任何内容 (.*),并且正则表达式必须从头开始到行尾(^$)。

这样我们就可以确定第三组会吃掉所有多余的分隔符。这只有效,因为您只有一个可能包含分隔符的字段。如果您想进行校对,您可以指定第四组以数字开头(在第四个括号块的开头包含 \d)。

我将所有反斜杠加倍,因为我们位于双引号表达式内,但我不太确定我应该这样做。

, '\\2')", col3 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)

此正则表达式需要六个由竖线分隔的捕获组(括号内)(我必须转义它,因为否则它在正则表达式中意味着 OR)。除第三组之外的所有组都不能包含竖线 ([^|]*),第三组可以包含任何内容 (.*),并且正则表达式必须从头开始到行尾(^$)。

这样我们就可以确定第三组会吃掉所有多余的分隔符。这只有效,因为您只有一个可能包含分隔符的字段。如果您想进行校对,您可以指定第四组以数字开头(在第四个括号块的开头包含 \d)。

我将所有反斜杠加倍,因为我们位于双引号表达式内,但我不太确定我应该这样做。

, '\\3')", col4 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)

此正则表达式需要六个由竖线分隔的捕获组(括号内)(我必须转义它,因为否则它在正则表达式中意味着 OR)。除第三组之外的所有组都不能包含竖线 ([^|]*),第三组可以包含任何内容 (.*),并且正则表达式必须从头开始到行尾(^$)。

这样我们就可以确定第三组会吃掉所有多余的分隔符。这只有效,因为您只有一个可能包含分隔符的字段。如果您想进行校对,您可以指定第四组以数字开头(在第四个括号块的开头包含 \d)。

我将所有反斜杠加倍,因为我们位于双引号表达式内,但我不太确定我应该这样做。

, '\\4')", col5 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)

此正则表达式需要六个由竖线分隔的捕获组(括号内)(我必须转义它,因为否则它在正则表达式中意味着 OR)。除第三组之外的所有组都不能包含竖线 ([^|]*),第三组可以包含任何内容 (.*),并且正则表达式必须从头开始到行尾(^$)。

这样我们就可以确定第三组会吃掉所有多余的分隔符。这只有效,因为您只有一个可能包含分隔符的字段。如果您想进行校对,您可以指定第四组以数字开头(在第四个括号块的开头包含 \d)。

我将所有反斜杠加倍,因为我们位于双引号表达式内,但我不太确定我应该这样做。

, '\\5')", col6 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)

此正则表达式需要六个由竖线分隔的捕获组(括号内)(我必须转义它,因为否则它在正则表达式中意味着 OR)。除第三组之外的所有组都不能包含竖线 ([^|]*),第三组可以包含任何内容 (.*),并且正则表达式必须从头开始到行尾(^$)。

这样我们就可以确定第三组会吃掉所有多余的分隔符。这只有效,因为您只有一个可能包含分隔符的字段。如果您想进行校对,您可以指定第四组以数字开头(在第四个括号块的开头包含 \d)。

我将所有反斜杠加倍,因为我们位于双引号表达式内,但我不太确定我应该这样做。

, '\\6')"

此正则表达式需要六个由竖线分隔的捕获组(括号内)(我必须转义它,因为否则它在正则表达式中意味着 OR)。除第三组之外的所有组都不能包含竖线 ([^|]*),第三组可以包含任何内容 (.*),并且正则表达式必须从头开始到行尾(^$)。

这样我们就可以确定第三组会吃掉所有多余的分隔符。这只有效,因为您只有一个可能包含分隔符的字段。如果您想进行校对,您可以指定第四组以数字开头(在第四个括号块的开头包含 \d)。

我将所有反斜杠加倍,因为我们位于双引号表达式内,但我不太确定我应该这样做。

You could not use any separator, and do something like:

field FILLER,
col1 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)

This regexp takes six capture groups (inside parentheses) separated by a vertical bar (I had to escape it because otherwise it means OR in regexp). All groups except the third cannot contain a vertical bar ([^|]*), the third group may contain anything (.*), and the regexp must span from beginning to end of the line (^ and $).

This way we are sure that the third group will eat all superfluous separators. This only works because you've only one field that may contain separators. If you want to proofcheck you can for example specify that the fourth group starts with a digit (include \d at the beginning of the fourth parenthesized block).

I have doubled all backslashes because we are inside a double-quoted expression, but I am not really sure that I ought to.

, '\\1')", col2 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)

This regexp takes six capture groups (inside parentheses) separated by a vertical bar (I had to escape it because otherwise it means OR in regexp). All groups except the third cannot contain a vertical bar ([^|]*), the third group may contain anything (.*), and the regexp must span from beginning to end of the line (^ and $).

This way we are sure that the third group will eat all superfluous separators. This only works because you've only one field that may contain separators. If you want to proofcheck you can for example specify that the fourth group starts with a digit (include \d at the beginning of the fourth parenthesized block).

I have doubled all backslashes because we are inside a double-quoted expression, but I am not really sure that I ought to.

, '\\2')", col3 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)

This regexp takes six capture groups (inside parentheses) separated by a vertical bar (I had to escape it because otherwise it means OR in regexp). All groups except the third cannot contain a vertical bar ([^|]*), the third group may contain anything (.*), and the regexp must span from beginning to end of the line (^ and $).

This way we are sure that the third group will eat all superfluous separators. This only works because you've only one field that may contain separators. If you want to proofcheck you can for example specify that the fourth group starts with a digit (include \d at the beginning of the fourth parenthesized block).

I have doubled all backslashes because we are inside a double-quoted expression, but I am not really sure that I ought to.

, '\\3')", col4 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)

This regexp takes six capture groups (inside parentheses) separated by a vertical bar (I had to escape it because otherwise it means OR in regexp). All groups except the third cannot contain a vertical bar ([^|]*), the third group may contain anything (.*), and the regexp must span from beginning to end of the line (^ and $).

This way we are sure that the third group will eat all superfluous separators. This only works because you've only one field that may contain separators. If you want to proofcheck you can for example specify that the fourth group starts with a digit (include \d at the beginning of the fourth parenthesized block).

I have doubled all backslashes because we are inside a double-quoted expression, but I am not really sure that I ought to.

, '\\4')", col5 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)

This regexp takes six capture groups (inside parentheses) separated by a vertical bar (I had to escape it because otherwise it means OR in regexp). All groups except the third cannot contain a vertical bar ([^|]*), the third group may contain anything (.*), and the regexp must span from beginning to end of the line (^ and $).

This way we are sure that the third group will eat all superfluous separators. This only works because you've only one field that may contain separators. If you want to proofcheck you can for example specify that the fourth group starts with a digit (include \d at the beginning of the fourth parenthesized block).

I have doubled all backslashes because we are inside a double-quoted expression, but I am not really sure that I ought to.

, '\\5')", col6 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)

This regexp takes six capture groups (inside parentheses) separated by a vertical bar (I had to escape it because otherwise it means OR in regexp). All groups except the third cannot contain a vertical bar ([^|]*), the third group may contain anything (.*), and the regexp must span from beginning to end of the line (^ and $).

This way we are sure that the third group will eat all superfluous separators. This only works because you've only one field that may contain separators. If you want to proofcheck you can for example specify that the fourth group starts with a digit (include \d at the beginning of the fourth parenthesized block).

I have doubled all backslashes because we are inside a double-quoted expression, but I am not really sure that I ought to.

, '\\6')"

This regexp takes six capture groups (inside parentheses) separated by a vertical bar (I had to escape it because otherwise it means OR in regexp). All groups except the third cannot contain a vertical bar ([^|]*), the third group may contain anything (.*), and the regexp must span from beginning to end of the line (^ and $).

This way we are sure that the third group will eat all superfluous separators. This only works because you've only one field that may contain separators. If you want to proofcheck you can for example specify that the fourth group starts with a digit (include \d at the beginning of the fourth parenthesized block).

I have doubled all backslashes because we are inside a double-quoted expression, but I am not really sure that I ought to.

本宫微胖 2025-01-03 22:21:46

在我看来,SQL*Loader 不太可能处理您的文件,因为第三个字段:可以包含分隔符,不被引号包围并且长度可变。相反,如果您提供的数据是准确的示例,那么我可以提供示例解决方法。首先,创建一个包含一列 VARCHAR2 的表,其长度与文件中任何一行的最大长度相同。然后将整个文件加载到该表中。从那里您可以使用查询提取每一列,例如:

with CTE as
       (select 'ABC|1234567|STR 9 R 25|98734959,32|28.12.2011' as CTETXT
          from dual
        union all
        select 'DE4|2346543|WE| 454|956584,84|28.11.2011' from dual)
select substr(CTETXT, 1, instr(CTETXT, '|') - 1) as COL1
      ,substr(CTETXT
             ,instr(CTETXT, '|', 1, 1) + 1
             ,instr(CTETXT, '|', 1, 2) - instr(CTETXT, '|', 1, 1) - 1)
         as COL2
      ,substr(CTETXT
             ,instr(CTETXT, '|', 1, 2) + 1
             ,instr(CTETXT, '|', -1, 1) - instr(CTETXT, '|', 1, 2) - 1)
         as COL3
      ,substr(CTETXT, instr(CTETXT, '|', -1, 1) + 1) as COL4
  from CTE

它并不完美(尽管它可能适用于 SQL*Loader),但如果您有更多列或者您的第三个字段不是我认为的那样,则需要一些工作是。但是,这是一个开始。

It looks to me that it's not really possible for SQL*Loader to handle your file because of the third field which: can contain the delimiter, is not surrounded by quotes and is of a variable length. Instead, if the data you provide is an accurate example then I can provide a sample workaround. First, create a table with one column of VARCHAR2 with length the same as the maximum length of any one line in your file. Then just load the entire file into this table. From there you can extract each column with a query such as:

with CTE as
       (select 'ABC|1234567|STR 9 R 25|98734959,32|28.12.2011' as CTETXT
          from dual
        union all
        select 'DE4|2346543|WE| 454|956584,84|28.11.2011' from dual)
select substr(CTETXT, 1, instr(CTETXT, '|') - 1) as COL1
      ,substr(CTETXT
             ,instr(CTETXT, '|', 1, 1) + 1
             ,instr(CTETXT, '|', 1, 2) - instr(CTETXT, '|', 1, 1) - 1)
         as COL2
      ,substr(CTETXT
             ,instr(CTETXT, '|', 1, 2) + 1
             ,instr(CTETXT, '|', -1, 1) - instr(CTETXT, '|', 1, 2) - 1)
         as COL3
      ,substr(CTETXT, instr(CTETXT, '|', -1, 1) + 1) as COL4
  from CTE

It's not perfect (though it may be adaptable to SQL*Loader) but would need a bit of work if you have more columns or if your third field is not what I think it is. But, it's a start.

猥琐帝 2025-01-03 22:21:46

好的,我建议您解析文件并替换分隔符。
在 Unix/linux 的命令行中你应该这样做:

cat current_file | awk -F'|' '{printf( "%s,%s,", $1, $2); for(k=3;k<NF-2;k++) printf("%s|", $k); printf("%s,%s,%s", $(NF-2),$(NF-1),$NF);print "";}' > new_file

这个命令不会改变你当前的文件。
将创建一个新文件,以逗号分隔,包含五个字段。
它将输入文件分割为“|”并把第一块、第二块、任何东西都放到前一块、前一块和最后一块。

您可以尝试使用“,”分隔符对 new_file 进行 sqlldr 操作。

更新:
该命令可以放入类似的脚本(并命名为 parse.awk)中,

#!/usr/bin/awk
# parse.awk
BEGIN {FS="|"}
{
printf("%s,%s,", $1, $2);

for(k=3;k<NF-2;k++)
        printf("%s|", $k);

printf("%s,%s,%s\n", $(NF-2),$(NF-1),$NF);
}

并且可以按以下方式运行:

cat current_file | awk  -f parse.awk > new_file

OK, I recomend you to parse the file and replace the delimiter.
In command line in Unix/linux you should do:

cat current_file | awk -F'|' '{printf( "%s,%s,", $1, $2); for(k=3;k<NF-2;k++) printf("%s|", $k); printf("%s,%s,%s", $(NF-2),$(NF-1),$NF);print "";}' > new_file

This command will not change your current file.
Will create a new file, comma delimited, with five fields.
It splits the input file on "|" and take first, second, anything to antelast, antelast, and last chunk.

You can try to sqlldr the new_file with "," delimiter.

UPDATE:
The command can be put in a script like (and named parse.awk)

#!/usr/bin/awk
# parse.awk
BEGIN {FS="|"}
{
printf("%s,%s,", $1, $2);

for(k=3;k<NF-2;k++)
        printf("%s|", $k);

printf("%s,%s,%s\n", $(NF-2),$(NF-1),$NF);
}

and you can run in this way:

cat current_file | awk  -f parse.awk > new_file
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文