SQL*Loader:处理数据中的分隔符
我正在通过 SQLLDR 将一些数据加载到 Oracle。源文件是“管道分隔”。
FIELDS TERMINATED BY '|'
但有些记录的数据中包含管道字符,而不是作为分隔符。因此,它会破坏记录的正确加载,因为它将 indata 管道字符理解为字段终止符。
你能给我指出解决这个问题的方向吗?
数据文件约9GB,手动编辑比较困难。
例如,
已加载行:
ABC|1234567|STR 9 R 25|98734959,32|28.12.2011
拒绝行:
DE4|2346543|WE| 454|956584,84|28.11.2011
错误:
Rejected - Error on table HSX, column DATE_N.
ORA-01847: day of month must be between 1 and last day of month
DATE_N 列是最后一列。
I am loading some data to Oracle via SQLLDR. The source file is "pipe delimited".
FIELDS TERMINATED BY '|'
But some records contain pipe character in data, and not as separator. So it breaks correct loading of records as it understands indata pipe characters as field terminator.
Can you point me a direction to solve this issue?
Data file is about 9 GB, so it is hard to edit manually.
For example,
Loaded row:
ABC|1234567|STR 9 R 25|98734959,32|28.12.2011
Rejected Row:
DE4|2346543|WE| 454|956584,84|28.11.2011
Error:
Rejected - Error on table HSX, column DATE_N.
ORA-01847: day of month must be between 1 and last day of month
DATE_N column is the last one.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您不能使用任何分隔符,并执行以下操作:
此正则表达式需要六个由竖线分隔的捕获组(括号内)(我必须转义它,因为否则它在正则表达式中意味着 OR)。除第三组之外的所有组都不能包含竖线 (
[^|]*
),第三组可以包含任何内容 (.*
),并且正则表达式必须从头开始到行尾(^
和$
)。这样我们就可以确定第三组会吃掉所有多余的分隔符。这只有效,因为您只有一个可能包含分隔符的字段。如果您想进行校对,您可以指定第四组以数字开头(在第四个括号块的开头包含
\d
)。我将所有反斜杠加倍,因为我们位于双引号表达式内,但我不太确定我应该这样做。
You could not use any separator, and do something like:
This regexp takes six capture groups (inside parentheses) separated by a vertical bar (I had to escape it because otherwise it means OR in regexp). All groups except the third cannot contain a vertical bar (
[^|]*
), the third group may contain anything (.*
), and the regexp must span from beginning to end of the line (^
and$
).This way we are sure that the third group will eat all superfluous separators. This only works because you've only one field that may contain separators. If you want to proofcheck you can for example specify that the fourth group starts with a digit (include
\d
at the beginning of the fourth parenthesized block).I have doubled all backslashes because we are inside a double-quoted expression, but I am not really sure that I ought to.
在我看来,SQL*Loader 不太可能处理您的文件,因为第三个字段:可以包含分隔符,不被引号包围并且长度可变。相反,如果您提供的数据是准确的示例,那么我可以提供示例解决方法。首先,创建一个包含一列 VARCHAR2 的表,其长度与文件中任何一行的最大长度相同。然后将整个文件加载到该表中。从那里您可以使用查询提取每一列,例如:
它并不完美(尽管它可能适用于 SQL*Loader),但如果您有更多列或者您的第三个字段不是我认为的那样,则需要一些工作是。但是,这是一个开始。
It looks to me that it's not really possible for SQL*Loader to handle your file because of the third field which: can contain the delimiter, is not surrounded by quotes and is of a variable length. Instead, if the data you provide is an accurate example then I can provide a sample workaround. First, create a table with one column of VARCHAR2 with length the same as the maximum length of any one line in your file. Then just load the entire file into this table. From there you can extract each column with a query such as:
It's not perfect (though it may be adaptable to SQL*Loader) but would need a bit of work if you have more columns or if your third field is not what I think it is. But, it's a start.
好的,我建议您解析文件并替换分隔符。
在 Unix/linux 的命令行中你应该这样做:
这个命令不会改变你当前的文件。
将创建一个新文件,以逗号分隔,包含五个字段。
它将输入文件分割为“|”并把第一块、第二块、任何东西都放到前一块、前一块和最后一块。
您可以尝试使用“,”分隔符对 new_file 进行 sqlldr 操作。
更新:
该命令可以放入类似的脚本(并命名为 parse.awk)中,
并且可以按以下方式运行:
OK, I recomend you to parse the file and replace the delimiter.
In command line in Unix/linux you should do:
This command will not change your current file.
Will create a new file, comma delimited, with five fields.
It splits the input file on "|" and take first, second, anything to antelast, antelast, and last chunk.
You can try to sqlldr the new_file with "," delimiter.
UPDATE:
The command can be put in a script like (and named parse.awk)
and you can run in this way: