使用 awk 删除包含唯一第一个字段的行?

发布于 2024-10-19 05:06:01 字数 214 浏览 1 评论 0原文

希望仅打印具有重复第一个字段的行。例如,从看起来像这样的数据:

1 abcd
1 efgh
2 ijkl
3 mnop
4 qrst
4 uvwx

应该打印出:(

1 abcd
1 efgh
4 qrst
4 uvwx

仅供参考 - 我的数据中第一个字段并不总是 1 个字符长)

Looking to print only lines that have a duplicate first field. e.g. from data that looks like this:

1 abcd
1 efgh
2 ijkl
3 mnop
4 qrst
4 uvwx

Should print out:

1 abcd
1 efgh
4 qrst
4 uvwx

(FYI - first field is not always 1 character long in my data)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

对岸观火 2024-10-26 05:06:01
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile

是的,你给它相同的文件作为输入两次。由于您事先不知道当前记录是否为 uniq,因此您在第一次传递时基于 $1 构建了一个数组,然后仅输出见过 $1< /code> 在第二遍中不止一次。

我确信有一些方法只需一次通过文件即可完成此操作,但我怀疑它们是否会那么“干净”

解释

  1. FNR==NR:只有当awk时,这才是正确的 正在读取第一个文件。它本质上是测试看到的记录总数 (NR) 与当前文件中的输入记录 (FNR)。
  2. a[$1]++:构建一个关联数组a,who的键是第一个字段($1),who的值各加1看到它的时间。
  3. next:如果达到此值,则忽略脚本的其余部分,从新的输入记录开始
  4. (a[$1] > 1) 这只会在第二遍./infile,它只打印我们多次看到的第一个字段($1)的记录。本质上,它是 if(a[$1] > 1){print $0}

概念证明的简写

$ cat ./infile
1 abcd
1 efgh
2 ijkl
3 mnop
4 qrst
4 uvwx

$ awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
1 abcd
1 efgh
4 qrst
4 uvwx
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile

Yes, you give it the same file as input twice. Since you don't know ahead of time if the current record is uniq or not, you build up an array based on $1 on the first pass then you only output records that have seen $1 more than once on the second pass.

I'm sure there are ways to do it with only a single pass through the file but I doubt they will be as "clean"

Explanation

  1. FNR==NR: This is only true when awk is reading the first file. It essentially tests total number of records seen (NR) vs the input record in the current file (FNR).
  2. a[$1]++: Build an associative array a who's key is the first field ($1) and who's value is incremented by one each time it's seen.
  3. next: Ignore the rest of the script if this is reached, start over with a new input record
  4. (a[$1] > 1) This will only be evaluated on the second pass of ./infile and it only prints records who's first field ($1) we've seen more than once. Essentially, it is shorthand for if(a[$1] > 1){print $0}

Proof of Concept

$ cat ./infile
1 abcd
1 efgh
2 ijkl
3 mnop
4 qrst
4 uvwx

$ awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
1 abcd
1 efgh
4 qrst
4 uvwx
审判长 2024-10-26 05:06:01

下面是一些 awk 代码,可以执行您想要的操作,假设输入已按其第一个字段进行分组(如 uniq 也需要):

BEGIN {f = ""; l = ""}
{
  if ($1 == f) {
    if (l != "") {
      print l
      l = ""
    }
    print $0
  } else {
    f = $1
    l = $0
  }
}

在此代码中,f 是前一个字段字段 1 和 l 的值是该组的第一行(如果已经打印出来则为空)。

Here is some awk code to do what you want, assuming the input is grouped by its first field already (like uniq also requires):

BEGIN {f = ""; l = ""}
{
  if ($1 == f) {
    if (l != "") {
      print l
      l = ""
    }
    print $0
  } else {
    f = $1
    l = $0
  }
}

In this code, f is the previous value of field 1 and l is the first line of the group (or empty if that has already been printed out).

耳钉梦 2024-10-26 05:06:01
BEGIN { IDLE = 0; DUP = 1; state = IDLE }

{ 
  if (state == IDLE) {
    if($1 == lasttime) {
       state = DUP
       print lastline
    } else state = IDLE
  } else {
    if($1 != lasttime)
        state = IDLE
  }
  if (state == DUP)
    print $0
  lasttime = $1
  lastline = $0
}
BEGIN { IDLE = 0; DUP = 1; state = IDLE }

{ 
  if (state == IDLE) {
    if($1 == lasttime) {
       state = DUP
       print lastline
    } else state = IDLE
  } else {
    if($1 != lasttime)
        state = IDLE
  }
  if (state == DUP)
    print $0
  lasttime = $1
  lastline = $0
}
戴着白色围巾的女孩 2024-10-26 05:06:01

假设您在问题中显示的有序输入:

awk '$1 == prev {if (prevline) print prevline; print $0; prevline=""; next} {prev = $1; prevline=$0}' inputfile

该文件只需读取一次。

Assuming ordered input as you show in your question:

awk '$1 == prev {if (prevline) print prevline; print $0; prevline=""; next} {prev = $1; prevline=$0}' inputfile

The file only needs to be read once.

栀子花开つ 2024-10-26 05:06:01

如果您可以使用 Ruby(1.9+)

#!/usr/bin/env ruby
hash = Hash.new{|h,k|h[k] = []}
File.open("file").each do |x|
  a,b=x.split(/\s+/,2)
  hash[a] << b
end
hash.each{|k,v| hash[k].each{|y| puts "#{k} #{y}" } if v.size>1 }

输出:

$ cat file
1 abcd
1 efgh
2 ijkl
3 mnop
4 qrst
4 uvwx
4 asdf
1 xzzz

$ ruby arrange.rb
1 abcd
1 efgh
1 xzzz
4 qrst
4 uvwx
4 asdf

If you can use Ruby(1.9+)

#!/usr/bin/env ruby
hash = Hash.new{|h,k|h[k] = []}
File.open("file").each do |x|
  a,b=x.split(/\s+/,2)
  hash[a] << b
end
hash.each{|k,v| hash[k].each{|y| puts "#{k} #{y}" } if v.size>1 }

output:

$ cat file
1 abcd
1 efgh
2 ijkl
3 mnop
4 qrst
4 uvwx
4 asdf
1 xzzz

$ ruby arrange.rb
1 abcd
1 efgh
1 xzzz
4 qrst
4 uvwx
4 asdf
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文