从行中提取可选字段值

发布于 2024-12-26 16:05:49 字数 712 浏览 0 评论 0原文

我的文本采用单独行的形式,其中每行都具有类似 CSV 的格式:

SOME BUNCH OF TEXT, FIELD_A: 12, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656

字段的顺序始终相同,但某些字段可能不存在。感兴趣的字段之间可以有其他字段,例如与上面的行相比,我也可以得到以下内容:

SOME BUNCH OF TEXT, FIELD_A: 12, NOT_INTERESTED: 235, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656, FIELDS

作为处理此文本的结果,我希望拥有一个又一个指定字段的干净的 CSV 文件:

12,0.2321,12:10:08 2011/07/22,656

如果某个字段不存在,那么我想简单地省略值(例如,FIELD_B 不存在):

12,,12:10:08 2011/07/22,656

如何使用 sed、perl 或 awk 等命令来执行此操作? 我尝试使用 perl -pe 's/^.*?(FIELD_A: (.*?),)?.*?$/\2/' 提取单个字段,但失败 - 正则表达式只是忽略我的场,即使它呈现

I have text in the form of separate lines, where each line has CSV-like format:

SOME BUNCH OF TEXT, FIELD_A: 12, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656

The order of fields is always the same, but some fields may be absent. There can be other fields between fields of interest, for example comparing to the line above I can get the following as well:

SOME BUNCH OF TEXT, FIELD_A: 12, NOT_INTERESTED: 235, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656, FIELDS

As the result of processing this text I want to have clean CSV file with my fields specified one after another:

12,0.2321,12:10:08 2011/07/22,656

If some field is absent then I would like to simple omit value (for example FIELD_B was absent):

12,,12:10:08 2011/07/22,656

How can I do this using commands like sed, perl or awk ?
I tried extracting single field with perl -pe 's/^.*?(FIELD_A: (.*?),)?.*?$/\2/' and failed - regex simply ignores my field even if it presents

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

写给空气的情书 2025-01-02 16:05:49

您可以将 awk 与关联数组一起使用,如下所示。循环字段并在 : 上分割它们。然后将键值对存储到关联数组中。最后打印出你想要的字段。

awk -F, '{
 split("",arr)
 for(i=1; i<=NF; i++){
   a=index($i, ":")
   if(a != 0){
     # split on first colon to get key-value pair
     key=substr($i,1,a-1)
     val=substr($i,a+1)

     # remove leading spaces from key and value
     gsub(/^ */,"",key)
     gsub(/^ */,"",val)

     # store in an associative array
     arr[key]=val
   }   
 }
 # print out the desired fields
 print arr["FIELD_A"]","arr["FIELD_B"]","arr["FIELD_C"]","arr["FIELD_D"]
}' data.txt

You can use awk with an associative array as shown below. Loop over the fields and split them on :. Then store the key-value pair into an associative array. Finally print out the fields you want.

awk -F, '{
 split("",arr)
 for(i=1; i<=NF; i++){
   a=index($i, ":")
   if(a != 0){
     # split on first colon to get key-value pair
     key=substr($i,1,a-1)
     val=substr($i,a+1)

     # remove leading spaces from key and value
     gsub(/^ */,"",key)
     gsub(/^ */,"",val)

     # store in an associative array
     arr[key]=val
   }   
 }
 # print out the desired fields
 print arr["FIELD_A"]","arr["FIELD_B"]","arr["FIELD_C"]","arr["FIELD_D"]
}' data.txt
还不是爱你 2025-01-02 16:05:49

这样怎么样(假设文件名已知):

#!/usr/bin/perl
use strict;
use warnings;

my @f = qw(FIELD_A FIELD_B FIELD_C FIELD_D);
while(my $line = <DATA>) {
    chomp $line;
    my @r;
    for(@f) {
        if ($line =~ /$_:\s*([^,]+)/) {
            push @r, $1;
        } else {
            push @r,'';
        }
    }
    print join(',',@r), "\n";
}
__DATA__
SOME BUNCH OF TEXT, FIELD_A: 12, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656
SOME BUNCH OF TEXT, FIELD_A: 12, NOT_INTERESTED: 235, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656, FIELDS
SOME BUNCH OF TEXT, FIELD_A: 12, NOT_INTERESTED: 235, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656, FIELDS

输出:

12,0.2321,12:10:08 2011/07/22,656
12,0.2321,12:10:08 2011/07/22,656
12,,12:10:08 2011/07/22,656

How about this way (assuming fileds names are known) :

#!/usr/bin/perl
use strict;
use warnings;

my @f = qw(FIELD_A FIELD_B FIELD_C FIELD_D);
while(my $line = <DATA>) {
    chomp $line;
    my @r;
    for(@f) {
        if ($line =~ /$_:\s*([^,]+)/) {
            push @r, $1;
        } else {
            push @r,'';
        }
    }
    print join(',',@r), "\n";
}
__DATA__
SOME BUNCH OF TEXT, FIELD_A: 12, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656
SOME BUNCH OF TEXT, FIELD_A: 12, NOT_INTERESTED: 235, FIELD_B: 0.2321, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656, FIELDS
SOME BUNCH OF TEXT, FIELD_A: 12, NOT_INTERESTED: 235, FIELD_C: 12:10:08 2011/07/22, FIELD_D: 656, FIELDS

output:

12,0.2321,12:10:08 2011/07/22,656
12,0.2321,12:10:08 2011/07/22,656
12,,12:10:08 2011/07/22,656
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文