从txt文件中删除重复行

发布于 2024-10-16 23:01:33 字数 471 浏览 3 评论 0原文

我正在处理包含按行分隔的数据的大型文本文件(~20MB)。 大多数数据条目都是重复的,我想删除这些重复项以仅保留一份副本。

此外,为了使问题稍微复杂一些,一些条目会重复,并附加一些额外的信息。在这种情况下,我需要保留包含额外信息的条目并删除旧版本。

例如 我需要从这里开始:

BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS
to this:
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS
NB. the final order doesn't matter.

什么是有效的方法来做到这一点?

我可以使用 awk、python 或任何标准的 Linux 命令行工具。

谢谢。

I am processing large text files (~20MB) containing data delimited by line.
Most data entries are duplicated and I want to remove these duplications to only keep one copy.

Also, to make the problem slightly more complicated, some entries are repeated with an extra bit of info appended. In this case I need to keep the entry containing the extra info and delete the older versions.

e.g.
I need to go from this:


BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS

to this:

JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS

NB. the final order doesn't matter.

What is an efficient way to do this?

I can use awk, python or any standard linux command line tool.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

何以笙箫默 2024-10-23 23:01:33

怎么样(在Python中):

prev = None
for line in sorted(open('file')):
  line = line.strip()
  if prev is not None and not line.startswith(prev):
    print prev
  prev = line
if prev is not None:
  print prev

如果您发现内存使用有问题,您可以使用Unix sort(即基于磁盘)并更改脚本,使其不读取整个文件进入内存。

How about the following (in Python):

prev = None
for line in sorted(open('file')):
  line = line.strip()
  if prev is not None and not line.startswith(prev):
    print prev
  prev = line
if prev is not None:
  print prev

If you find memory usage an issue, you can do the sort as a pre-processing step using Unix sort (which is disk-based) and change the script so that it doesn't read the entire file into memory.

自此以后,行同陌路 2024-10-23 23:01:33

awk '{x[$1 " " $2 " " $3] = $0} END {for (y in x) print x[y]}'

如果需要指定不同文件的列数:

awk -v ncols=3 '
  {
    key = "";
    for (i=1; i<=ncols; i++) {key = key FS $i}
    if (length($0) > length(x[key])) {x[key] = $0}
  }
  END {for (y in x) print y "\t" x[y]}
'

awk '{x[$1 " " $2 " " $3] = $0} END {for (y in x) print x[y]}'

If you need to specify the number of columns for different files:

awk -v ncols=3 '
  {
    key = "";
    for (i=1; i<=ncols; i++) {key = key FS $i}
    if (length($0) > length(x[key])) {x[key] = $0}
  }
  END {for (y in x) print y "\t" x[y]}
'
剧终人散尽 2024-10-23 23:01:33

无论具有额外位的行的位置如何,格伦·杰克曼答案的这种变化都应该有效:

awk '{idx = $1 " " $2 " " $3; if (length($0) > length(x[idx])) x[idx] = $0} END {for (idx in x) print x[idx]}' inputfile

或者

awk -v ncols=3 '
  {
    key = "";
    for (i=1; i<=ncols; i++) {key = key FS $i}
    if (length($0) > length(x[key])) x[key] = $0
  }
  END {for (y in x) print x[y]}
' inputfile

This variation on glenn jackman's answer should work regardless of the position of lines with extra bits:

awk '{idx = $1 " " $2 " " $3; if (length($0) > length(x[idx])) x[idx] = $0} END {for (idx in x) print x[idx]}' inputfile

Or

awk -v ncols=3 '
  {
    key = "";
    for (i=1; i<=ncols; i++) {key = key FS $i}
    if (length($0) > length(x[key])) x[key] = $0
  }
  END {for (y in x) print x[y]}
' inputfile
终难愈 2024-10-23 23:01:33

这个或一个轻微的变体应该做:

finalData = {}
for line in input:
    parts = line.split()
    key,extra = tuple(parts[0:3]),parts[3:]
    if key not in finalData or extra:
        finalData[key] = extra

pprint(finalData)

输出:

{('BOB', '123', '1DB'): ['EXTRA', 'BITS'],
 ('DAVE', '789', '1DB'): [],
 ('JIM', '456', '3DB'): ['AX']}

This or a slight variant should do:

finalData = {}
for line in input:
    parts = line.split()
    key,extra = tuple(parts[0:3]),parts[3:]
    if key not in finalData or extra:
        finalData[key] = extra

pprint(finalData)

outputs:

{('BOB', '123', '1DB'): ['EXTRA', 'BITS'],
 ('DAVE', '789', '1DB'): [],
 ('JIM', '456', '3DB'): ['AX']}
梦里兽 2024-10-23 23:01:33

您必须定义一个函数来将您的行分成重要位和额外位,然后您可以执行以下操作:

def split_extra(s):
    """Return a pair, the important bits and the extra bits."""
    return blah blah blah

data = {}
for line in open('file'):
    impt, extra = split_extra(line)
    existing = data.setdefault(impt, extra)
    if len(extra) > len(existing):
        data[impt] = extra

out = open('newfile', 'w')
for impt, extra in data.iteritems():
    out.write(impt + extra)

You'll have to define a function to split your line into important bits and extra bits, then you can do:

def split_extra(s):
    """Return a pair, the important bits and the extra bits."""
    return blah blah blah

data = {}
for line in open('file'):
    impt, extra = split_extra(line)
    existing = data.setdefault(impt, extra)
    if len(extra) > len(existing):
        data[impt] = extra

out = open('newfile', 'w')
for impt, extra in data.iteritems():
    out.write(impt + extra)

由于您需要额外的位,最快的方法是创建一组唯一的条目(sort -u 即可),然后您必须将每个条目相互比较,例如

if x.startswith(y) and not y.startswith(x)

and just leave x and discard y.

Since you need the extra bits the fastest way is to create a set of unique entries (sort -u will do) and then you must compare each entry against each other, e.g.

if x.startswith(y) and not y.startswith(x)


and just leave x and discard y.

终陌 2024-10-23 23:01:33

如果您有 perl 并且只想保留最后一个条目:

cat file.txt | perl -ne 'BEGIN{%k={}} @_ = split(/ /);$kw = shift(@_); $kws{$kw} = "@_"; END{ foreach(sort keys %kws){ print "$_ $kws{$_}";} }' > file.new.txt

If you have perl and want only the last entry to be preserved :

cat file.txt | perl -ne 'BEGIN{%k={}} @_ = split(/ /);$kw = shift(@_); $kws{$kw} = "@_"; END{ foreach(sort keys %kws){ print "$_ $kws{$_}";} }' > file.new.txt
如若梦似彩虹 2024-10-23 23:01:33

函数find_unique_lines适用于文件对象或字符串列表。

import itertools

def split_line(s):
    parts = s.strip().split(' ')
    return " ".join(parts[:3]), parts[3:], s

def find_unique_lines(f):
    result = {}
    for key, data, line in itertools.imap(split_line, f):
        if data or key not in result:
            result[key] = line
    return result.itervalues()

test = """BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS""".split('\n')

for line in find_unique_lines(test):
        print line
BOB 123 1DB EXTRA BITS
JIM 456 3DB AX
DAVE 789 1DB

The function find_unique_lines will work for a file object or a list of strings.

import itertools

def split_line(s):
    parts = s.strip().split(' ')
    return " ".join(parts[:3]), parts[3:], s

def find_unique_lines(f):
    result = {}
    for key, data, line in itertools.imap(split_line, f):
        if data or key not in result:
            result[key] = line
    return result.itervalues()

test = """BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS""".split('\n')

for line in find_unique_lines(test):
        print line
BOB 123 1DB EXTRA BITS
JIM 456 3DB AX
DAVE 789 1DB
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文