使用 awk 处理每个记录具有不同固定宽度字段的文件
我有一些来自遗留系统的数据文件,我想使用 Awk 对其进行处理。每个文件由一个记录列表组成。有多种不同的记录类型,每种记录类型都有一组不同的固定宽度字段(没有字段分隔符)。记录的前两个字符指示类型,由此您可以知道应该遵循哪些字段。文件可能看起来像这样:
AAField1Field2LongerField3
BBField4Field5Field6VeryVeryLongField7Field8
CCField99
使用 Gawk 我可以设置 FIELDWIDTHS,但这适用于整个文件(除非我缺少某种在逐条记录的基础上设置此选项的方法),或者我可以将 FS 设置为“”并一次处理一个字符文件,但这有点麻烦。
有没有一种使用 Awk 从此类文件中提取字段的好方法?
编辑:是的,我可以使用Perl(或其他东西)。不过,我仍然很想知道是否有一种明智的方法可以用 Awk 来实现这一点。
I have some data files from a legacy system that I would like to process using Awk. Each file consists of a list of records. There are several different record types and each record type has a different set of fixed-width fields (there is no field separator character). The first two characters of the record indicate the type, from this you then know which fields should follow. A file might look something like this:
AAField1Field2LongerField3
BBField4Field5Field6VeryVeryLongField7Field8
CCField99
Using Gawk I can set the FIELDWIDTHS, but that applies to the whole file (unless I am missing some way of setting this on a record-by-record basis), or I can set FS to "" and process the file one character at a time, but that's a bit cumbersome.
Is there a good way to extract the fields from such a file using Awk?
Edit: Yes, I could use Perl (or something else). I'm still keen to know whether there is a sensible way of doing it with Awk though.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
希望这会引导您走向正确的方向。假设您的多行记录保证以“CC”类型行终止,您可以使用简单的 if-then 逻辑预处理文本文件。我假设您需要将 fields1,5 和 7 放在一行上,并且需要一个示例 awk 脚本。
创建一个名为program.awk 的awk 脚本文件并将该代码弹出到其中。使用以下命令执行脚本:
Hopefully this will lead you in the right direction. Assuming your multi-line records are guaranteed to be terminated by a 'CC' type row you can pre-process your text file using simple if-then logic. I have presumed you require fields1,5 and 7 on one row and a sample awk script would be.
Create an awk script file called program.awk and pop that code into it. Execute the script using :
您也许可以使用两遍:
1step.awk
2step.awk
然后
You maybe can use two passes:
1step.awk
2step.awk
And then
您可能需要抑制(或至少忽略)
awk
的内置字段分隔代码,并使用以下程序:手动处理会有点繁琐 - 我想您'我需要使用 substr 函数按位置提取每个字段,因此我得到的每个记录类型一行将更像每个记录类型中每个字段一行,加上以下 -关于印刷。
我确实认为 Perl 及其
unpack
功能可能会更好,但awk
也可以处理它,尽管很冗长。You probably need to suppress (or at least ignore)
awk
's built-in field separation code, and use a program along the lines of:The manual processing will be a bit fiddly - I suppose you'll need to use the
substr
function to extract each field by position, so what I've got as one line per record type will be more like one line per field in each record type, plus the follow-on printing.I do think you might be better off with Perl and its
unpack
feature, butawk
can handle it too, albeit verbosely.您可以使用 Perl,然后根据该行的前两个字符选择一个解包模板吗?
Could you use Perl and then select an unpack template based on the first two chars of the line?
一个使用数组来跟踪不同
FIELDWIDTHS
格式的 awk 想法:这会生成:
One
awk
idea using an array to keep track of the differentFIELDWIDTHS
formats:This generates:
最好使用一些功能齐全的脚本语言,如 perl 或 ruby。
Better use some fully featured scripting language like perl or ruby.
2个脚本怎么样?例如,第一个脚本根据第一个字符插入字段分隔符,那么第二个脚本应该处理它?
或者首先在 AWK 脚本中定义一些函数,该函数根据输入将行拆分为变量 - 我会这样做,以便可能的重用。
What about 2 scripts? E.g. 1st script inserts field separators based on the first characters, then the 2nd should process it?
Or first of all define some function in your AWK script, which splits the lines into variables based on the input - I would go this way, for the possible re-usage.