使用 awk 处理每个记录具有不同固定宽度字段的文件

发布于 2024-08-03 17:33:11 字数 581 浏览 12 评论 0原文

我有一些来自遗留系统的数据文件，我想使用 Awk 对其进行处理。每个文件由一个记录列表组成。有多种不同的记录类型，每种记录类型都有一组不同的固定宽度字段（没有字段分隔符）。记录的前两个字符指示类型，由此您可以知道应该遵循哪些字段。文件可能看起来像这样：

AAField1Field2LongerField3
BBField4Field5Field6VeryVeryLongField7Field8
CCField99

使用 Gawk 我可以设置 FIELDWIDTHS，但这适用于整个文件（除非我缺少某种在逐条记录的基础上设置此选项的方法），或者我可以将 FS 设置为“”并一次处理一个字符文件，但这有点麻烦。

有没有一种使用 Awk 从此类文件中提取字段的好方法？

编辑：是的，我可以使用Perl（或其他东西）。不过，我仍然很想知道是否有一种明智的方法可以用 Awk 来实现这一点。

原文

I have some data files from a legacy system that I would like to process using Awk. Each file consists of a list of records. There are several different record types and each record type has a different set of fixed-width fields (there is no field separator character). The first two characters of the record indicate the type, from this you then know which fields should follow. A file might look something like this:

AAField1Field2LongerField3
BBField4Field5Field6VeryVeryLongField7Field8
CCField99

Using Gawk I can set the FIELDWIDTHS, but that applies to the whole file (unless I am missing some way of setting this on a record-by-record basis), or I can set FS to "" and process the file one character at a time, but that's a bit cumbersome.

Is there a good way to extract the fields from such a file using Awk?

Edit: Yes, I could use Perl (or something else). I'm still keen to know whether there is a sensible way of doing it with Awk though.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烦人精 2024-08-10 17:33:11

希望这会引导您走向正确的方向。假设您的多行记录保证以“CC”类型行终止，您可以使用简单的 if-then 逻辑预处理文本文件。我假设您需要将 fields1,5 和 7 放在一行上，并且需要一个示例 awk 脚本。

BEGIN {
        field1=""
        field5=""
        field7=""
}
{
    record_type = substr($0,1,2)
    if (record_type == "AA")
    {
        field1=substr($0,3,6)
    }
    else if (record_type == "BB")
    {
        field5=substr($0,9,6)
        field7=substr($0,21,18)
    }
    else if (record_type == "CC")
    {
        print field1"|"field5"|"field7
    }
}

创建一个名为program.awk 的awk 脚本文件并将该代码弹出到其中。使用以下命令执行脚本：

awk -f program.awk < my_multi_line_file.txt

Hopefully this will lead you in the right direction. Assuming your multi-line records are guaranteed to be terminated by a 'CC' type row you can pre-process your text file using simple if-then logic. I have presumed you require fields1,5 and 7 on one row and a sample awk script would be.

BEGIN {
        field1=""
        field5=""
        field7=""
}
{
    record_type = substr($0,1,2)
    if (record_type == "AA")
    {
        field1=substr($0,3,6)
    }
    else if (record_type == "BB")
    {
        field5=substr($0,9,6)
        field7=substr($0,21,18)
    }
    else if (record_type == "CC")
    {
        print field1"|"field5"|"field7
    }
}

Create an awk script file called program.awk and pop that code into it. Execute the script using :

awk -f program.awk < my_multi_line_file.txt

回复收藏 0 原文

北城孤痞 2024-08-10 17:33:11

您也许可以使用两遍：

1step.awk

/^AA/{printf "2 6 6 12"    }
/^BB/{printf "2 6 6 6 18 6"}
/^CC/{printf "2 8"         }
{printf "\n%s\n", $0}

2step.awk

NR%2 == 1 {FIELDWIDTHS=$0}
NR%2 == 0 {print $2}

然后

awk -f 1step.awk sample  | awk -f 2step.awk

You maybe can use two passes:

1step.awk

/^AA/{printf "2 6 6 12"    }
/^BB/{printf "2 6 6 6 18 6"}
/^CC/{printf "2 8"         }
{printf "\n%s\n", $0}

2step.awk

NR%2 == 1 {FIELDWIDTHS=$0}
NR%2 == 0 {print $2}

And then

awk -f 1step.awk sample  | awk -f 2step.awk

回复收藏 0 原文

ゃ懵逼小萝莉 2024-08-10 17:33:11

您可能需要抑制（或至少忽略）awk 的内置字段分隔代码，并使用以下程序：

awk '/^AA/ { manually process record AA out of $0 }
     /^BB/ { manually process record BB out of $0 }
     /^CC/ { manually process record CC out of $0 }' file ...

手动处理会有点繁琐 - 我想您'我需要使用 substr 函数按位置提取每个字段，因此我得到的每个记录类型一行将更像每个记录类型中每个字段一行，加上以下 -关于印刷。

我确实认为 Perl 及其 unpack 功能可能会更好，但 awk 也可以处理它，尽管很冗长。

You probably need to suppress (or at least ignore) awk's built-in field separation code, and use a program along the lines of:

awk '/^AA/ { manually process record AA out of $0 }
     /^BB/ { manually process record BB out of $0 }
     /^CC/ { manually process record CC out of $0 }' file ...

The manual processing will be a bit fiddly - I suppose you'll need to use the substr function to extract each field by position, so what I've got as one line per record type will be more like one line per field in each record type, plus the follow-on printing.

I do think you might be better off with Perl and its unpack feature, but awk can handle it too, albeit verbosely.

回复收藏 0 原文

稚然 2024-08-10 17:33:11

您可以使用 Perl，然后根据该行的前两个字符选择一个解包模板吗？

回复收藏 0 原文

花伊自在美 2024-08-10 17:33:11

一个使用数组来跟踪不同 FIELDWIDTHS 格式的 awk 想法：

awk '
BEGIN { fw["AA"] = "2 6 6 12"                     # predefined FIELDWIDTHS
        fw["BB"] = "2 6 6  6 18 6"
        fw["CC"] = "2 7"
      }
      { FIELDWIDTHS = fw[substr($0,1,2)]          # dynamically define FIELDWIDTHS based on 1st two characters
        $0 = $0                                   # force reparse of input line based on new FIELDWIDTHS
        print "#############",$0
        for (i=1;i<=NF;i++)
            print "field #"i,":",$i
      }
' input.txt

这会生成：

############# AAField1Field2LongerField3
field #1 : AA
field #2 : Field1
field #3 : Field2
field #4 : LongerField3
############# BBField4Field5Field6VeryVeryLongField7Field8
field #1 : BB
field #2 : Field4
field #3 : Field5
field #4 : Field6
field #5 : VeryVeryLongField7
field #6 : Field8
############# CCField99
field #1 : CC
field #2 : Field99

One awk idea using an array to keep track of the different FIELDWIDTHS formats:

awk '
BEGIN { fw["AA"] = "2 6 6 12"                     # predefined FIELDWIDTHS
        fw["BB"] = "2 6 6  6 18 6"
        fw["CC"] = "2 7"
      }
      { FIELDWIDTHS = fw[substr($0,1,2)]          # dynamically define FIELDWIDTHS based on 1st two characters
        $0 = $0                                   # force reparse of input line based on new FIELDWIDTHS
        print "#############",$0
        for (i=1;i<=NF;i++)
            print "field #"i,":",$i
      }
' input.txt

This generates:

############# AAField1Field2LongerField3
field #1 : AA
field #2 : Field1
field #3 : Field2
field #4 : LongerField3
############# BBField4Field5Field6VeryVeryLongField7Field8
field #1 : BB
field #2 : Field4
field #3 : Field5
field #4 : Field6
field #5 : VeryVeryLongField7
field #6 : Field8
############# CCField99
field #1 : CC
field #2 : Field99

回复收藏 0 原文