如何用AWK删除部分重复行?

发布于 2024-08-08 03:50:06 字数 510 浏览 5 评论 0原文

我有包含此类重复行的文件,其中只有最后一个字段不同:

OST,0202000070,01-AUG-09,002735,6,0,0202000068,4520688,-1,0,0,0,0,0,55
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,5
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,55
OST,0202000068,01-AUG-09,003019,6,0,0202000071,4520690,-1,0,0,0,0,0,55

我需要删除该行的第一次出现并保留第二个。

我已经尝试过:

awk '!x[$0]++ {getline; print $0}' file.csv

但它没有按预期工作,因为它还删除了非重复行。

I have files with these kind of duplicate lines, where only the last field is different:

OST,0202000070,01-AUG-09,002735,6,0,0202000068,4520688,-1,0,0,0,0,0,55
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,5
ONE,0208076826,01-AUG-09,002332,316,3481.055935,0204330827,29150,200,0,0,0,0,0,55
OST,0202000068,01-AUG-09,003019,6,0,0202000071,4520690,-1,0,0,0,0,0,55

I need to remove the first occurrence of the line and leave the second one.

I've tried:

awk '!x[$0]++ {getline; print $0}' file.csv

but it's not working as intended, as it's also removing non duplicate lines.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

没企图 2024-08-15 03:50:06
#!/bin/awk -f
{
    s = substr($0, 0, match($0, /,[^,]+$/))
    if (!seen[s]) {
        print $0
        seen[s] = 1
    }
}
#!/bin/awk -f
{
    s = substr($0, 0, match($0, /,[^,]+$/))
    if (!seen[s]) {
        print $0
        seen[s] = 1
    }
}
小耗子 2024-08-15 03:50:06

如果您的近似重复项始终相邻,您只需与前一个条目进行比较即可避免创建潜在的巨大关联数组。

#!/bin/awk -f
{
    s = substr($0, 0, match($0, /,[^,]*$/))
    if (s != prev) {
        print prev0
    }
    prev = s
    prev0 = $0
} 
END {
    print $0
}

编辑:更改了脚本,以便打印一组近似重复项中的最后一个(不需要 tac)。

If your near-duplicates are always adjacent, you can just compare to the previous entry and avoid creating a potentially huge associative array.

#!/bin/awk -f
{
    s = substr($0, 0, match($0, /,[^,]*$/))
    if (s != prev) {
        print prev0
    }
    prev = s
    prev0 = $0
} 
END {
    print $0
}

Edit: Changed the script so it prints the last one in a group of near-duplicates (no tac needed).

傲鸠 2024-08-15 03:50:06

作为一般策略(尽管我上过 Aho 课程,但我并不是 AWK 专业人士),您可以尝试:

  1. 连接除
    最后一个。
  2. 使用该字符串作为哈希的键。
  3. 将整行存储为值
    到一个哈希值。
  4. 当你处理完所有的线后,
    循环打印哈希值
    价值观。

这不是 AWK 特定的,我无法轻松提供任何示例代码,但这是我首先尝试的。

As a general strategy (I'm not much of an AWK pro despite taking classes with Aho) you might try:

  1. Concatenate all the fields except
    the last.
  2. Use this string as a key to a hash.
  3. Store the entire line as the value
    to a hash.
  4. When you have processed all lines,
    loop through the hash printing out
    the values.

This isn't AWK specific and I can't easily provide any sample code, but this is what I would first try.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文