如何仅合并从 file_a 到 file_b 的唯一行?
这个问题已经在这里以一种或另一种形式提出,但不完全是我正在寻找的东西。因此,这就是我将遇到的情况:我已经有一个名为 file_a
的文件,并且正在创建另一个文件 - file_b
。 file_a 的大小始终大于 file_b。 file_b 中会有许多重复行(因此 file_a 中也有),但这两个文件都会有一些唯一的行。我想要做的是:仅复制/合并从 file_a
到 file_b
的唯一行,然后对行顺序进行排序,以便 file_b 成为最新的- 包含所有唯一条目的日期。任一原始文件的大小不应超过 10MB。我能做到这一点的最有效(最快)的方法是什么?
我在想类似的事情,这样合并就可以了。
#!/usr/bin/env python
import os, time, sys
# Convert Date/time to epoch
def toEpoch(dt):
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
# input files
o_file = "file_a"
c_file = "file_b"
n_file = [o_file,c_file]
m_file = "merged.file"
for x in range(len(n_file)):
P = open(n_file[x],"r")
output = P.readlines()
P.close()
# Sort the output, order by 2nd last field
#sp_lines = [ line.split('\t') for line in output ]
#sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )
F = open(m_file,'w')
#for line in sp_lines:
for line in output:
if "group_" in line:
F.write(line)
F.close()
但是,它是:
- 不仅是
- 未排序的唯一行(按最后一个字段旁边)
- 并引入了第三个文件,即
m_file
只是旁注(长话短说):我不能使用排序( )不幸的是,因为我正在使用 v2.3。输入文件如下所示:
On 23/03/11 00:40:03
JobID Group.User Ctime Wtime Status QDate CDate
===================================================================================
430792 group_atlas.pltatl16 0 32 4 02/03/11 21:52:38 02/03/11 22:02:15
430793 group_atlas.atlas084 30 472 4 02/03/11 21:57:43 02/03/11 22:09:35
430794 group_atlas.atlas084 12 181 4 02/03/11 22:02:37 02/03/11 22:05:42
430796 group_atlas.atlas084 8 185 4 02/03/11 22:02:38 02/03/11 22:05:46
我尝试使用 cmp() 按倒数第二个字段进行排序,但我认为,仅仅因为输入文件的前 3 行,它不起作用。
有人可以帮忙吗?干杯!!!
Update 1:
为了供将来参考,按照雅各布的建议,这里是完整的脚本。效果很好。
#!/usr/bin/env python
import os, time, sys
from sets import Set as set
def toEpoch(dt):
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
def yield_lines(fileobj):
#I want to discard the headers
for i in xrange(3):
fileobj.readline()
#
for line in fileobj:
yield line
def app(path1, path2):
file1 = set(yield_lines(open(path1)))
file2 = set(yield_lines(open(path2)))
return file1.union(file2)
# Input files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"
print time.strftime('%H:%M:%S', time.localtime())
# Sorting the output, order by 2nd last field
sp_lines = [ line.split('\t') for line in app(o_file, c_file) ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )
F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)
for line in sp_lines:
MF = '\t'.join(line)
F.write(MF)
F.close()
完成 145244 行大约需要 2 分 47 秒。
[testac1@serv07 ~]$ ./uniq-merge.py
17:19:21
No. of lines: 145244
17:22:08
谢谢!!
Update 2:
您好 eyquem,这是我运行您的脚本时收到的错误消息。
从第一个脚本:
[testac1@serv07 ~]$ ./uniq-merge_2.py
File "./uniq-merge_2.py", line 44
fm.writelines( '\n'.join(v)+'\n' for k,v in output )
^
SyntaxError: invalid syntax
从第二个脚本:
[testac1@serv07 ~]$ ./uniq-merge_3.py
File "./uniq-merge_3.py", line 24
output = sett(line.rstrip() for line in fa)
^
SyntaxError: invalid syntax
干杯!
Update 3:
前一个根本没有对列表进行排序。感谢 eyquem 指出这一点。嗯,现在确实如此。这是对 Jakob 版本的进一步修改 - 我将 set:app(path1, path2) 转换为 list:myList() ,然后将 sort( lambda ... ) 应用于 myList
进行排序通过嵌套到最后一个字段的合并文件。这是最终的脚本。
#!/usr/bin/env python
import os, time, sys
from sets import Set as set
def toEpoch(dt):
# Convert date/time to epoch
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
def yield_lines(fileobj):
# Discard the headers (1st 3 lines)
for i in xrange(3):
fileobj.readline()
for line in fileobj:
yield line
def app(path1, path2):
# Remove duplicate lines
file1 = set(yield_lines(open(path1)))
file2 = set(yield_lines(open(path2)))
return file1.union(file2)
print time.strftime('%H:%M:%S', time.localtime())
# I/O files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"
# Convert set into to list
myList = list(app(o_file, c_file))
# Sort the list by the date
sp_lines = [ line.split('\t') for line in myList ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )
F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)
# Finally write to the outFile
for line in sp_lines:
MF = '\t'.join(line)
F.write(MF)
F.close()
速度根本没有提升,处理同样的 145244 行花了 2m:50s。有人看到任何改进范围,请告诉我。感谢 Jakob 和 eyquem 抽出时间。干杯!!
Update 4:
仅供将来参考,这是 eyguem 的修改版本,它比以前的版本运行得更好更快。
#!/usr/bin/env python
import os, sys, re
from sets import Set as sett
from time import mktime, strptime, strftime
def sorting_merge(o_file, c_file, m_file ):
# RegEx for Date/time filed
pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d')
def kl(lines,pat = pat):
# match only the next to last field
line = lines.split('\t')
line = line[-2]
return mktime(strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
output = sett()
head = []
# Separate the header & remove the duplicates
def rmHead(f_n):
f_n.readline()
for line1 in f_n:
if pat.search(line1): break
else: head.append(line1) # line of the header
for line in f_n:
output.add(line.rstrip())
output.add(line1.rstrip())
f_n.close()
fa = open(o_file, 'r')
rmHead(fa)
fb = open(c_file, 'r')
rmHead(fb)
# Sorting date-wise
output = [ (kl(line),line.rstrip()) for line in output if line.rstrip() ]
output.sort()
fm = open(m_file,'w')
# Write to the file & add the header
fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head[0]+head[1])))
for t,line in output:
fm.write(line + '\n')
fm.close()
c_f = "03_a"
o_f = "03_b"
sorting_merge(o_f, c_f, 'outfile.txt')
这个版本要快得多 - 6.99 秒。与 2m:47s 相比,145244 行 - 然后是使用 lambda a, b: cmp() 的前一行。感谢 eyquem 的所有支持。干杯!!
This question has been asked here in one form or another but not quite the thing I'm looking for. So, this is the situation I shall be having: I already have one file, named file_a
and I'm creating another file - file_b
. file_a is always bigger than file_b in size. There will be a number of duplicate lines in file_b (hence, in file_a as well) but both the files will have some unique lines. What I want to do is: to copy/merge only the unique lines from file_a
to file_b
and then sort the line order, so that the file_b becomes the most up-to-date one with all the unique entries. Either of the original files shouldn't be more than 10MB in size. What's the most efficient (and fastest) way I can do that?
I was thinking something like that, which does the merging alright.
#!/usr/bin/env python
import os, time, sys
# Convert Date/time to epoch
def toEpoch(dt):
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
# input files
o_file = "file_a"
c_file = "file_b"
n_file = [o_file,c_file]
m_file = "merged.file"
for x in range(len(n_file)):
P = open(n_file[x],"r")
output = P.readlines()
P.close()
# Sort the output, order by 2nd last field
#sp_lines = [ line.split('\t') for line in output ]
#sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )
F = open(m_file,'w')
#for line in sp_lines:
for line in output:
if "group_" in line:
F.write(line)
F.close()
But, it's:
- not with only the unique lines
- not sorted (by next to last field)
- and introduces the 3rd file i.e.
m_file
Just a side note (long story short): I can't use sorted() here as I'm using v2.3, unfortunately. The input files look like this:
On 23/03/11 00:40:03
JobID Group.User Ctime Wtime Status QDate CDate
===================================================================================
430792 group_atlas.pltatl16 0 32 4 02/03/11 21:52:38 02/03/11 22:02:15
430793 group_atlas.atlas084 30 472 4 02/03/11 21:57:43 02/03/11 22:09:35
430794 group_atlas.atlas084 12 181 4 02/03/11 22:02:37 02/03/11 22:05:42
430796 group_atlas.atlas084 8 185 4 02/03/11 22:02:38 02/03/11 22:05:46
I tried to use cmp() to sort by the 2nd last field but, I think, it doesn't work just because of the first 3 lines of the input files.
Can anyone please help? Cheers!!!
Update 1:
For the future reference, as suggested by Jakob, here is the complete script. It worked just fine.
#!/usr/bin/env python
import os, time, sys
from sets import Set as set
def toEpoch(dt):
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
def yield_lines(fileobj):
#I want to discard the headers
for i in xrange(3):
fileobj.readline()
#
for line in fileobj:
yield line
def app(path1, path2):
file1 = set(yield_lines(open(path1)))
file2 = set(yield_lines(open(path2)))
return file1.union(file2)
# Input files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"
print time.strftime('%H:%M:%S', time.localtime())
# Sorting the output, order by 2nd last field
sp_lines = [ line.split('\t') for line in app(o_file, c_file) ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )
F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)
for line in sp_lines:
MF = '\t'.join(line)
F.write(MF)
F.close()
It took about 2m:47s to finish for 145244 lines.
[testac1@serv07 ~]$ ./uniq-merge.py
17:19:21
No. of lines: 145244
17:22:08
thanks!!
Update 2:
Hi eyquem, this is the Error message I get when I run your script(s).
From the first script:
[testac1@serv07 ~]$ ./uniq-merge_2.py
File "./uniq-merge_2.py", line 44
fm.writelines( '\n'.join(v)+'\n' for k,v in output )
^
SyntaxError: invalid syntax
From the second script:
[testac1@serv07 ~]$ ./uniq-merge_3.py
File "./uniq-merge_3.py", line 24
output = sett(line.rstrip() for line in fa)
^
SyntaxError: invalid syntax
Cheers!!
Update 3:
The previous one wasn't sorting the list at all. Thanks to eyquem to pointing that out. Well, it does now. This is a further modification to Jakob's version - I converted the set:app(path1, path2) to a list:myList() and then applied the sort( lambda ... ) to the myList
to sort the merged file by the nest to last field. This is the final script.
#!/usr/bin/env python
import os, time, sys
from sets import Set as set
def toEpoch(dt):
# Convert date/time to epoch
dt_ptrn = '%d/%m/%y %H:%M:%S'
return int(time.mktime(time.strptime(dt, dt_ptrn)))
def yield_lines(fileobj):
# Discard the headers (1st 3 lines)
for i in xrange(3):
fileobj.readline()
for line in fileobj:
yield line
def app(path1, path2):
# Remove duplicate lines
file1 = set(yield_lines(open(path1)))
file2 = set(yield_lines(open(path2)))
return file1.union(file2)
print time.strftime('%H:%M:%S', time.localtime())
# I/O files
o_file = "testScript/03"
c_file = "03.bak"
m_file = "finished.file"
# Convert set into to list
myList = list(app(o_file, c_file))
# Sort the list by the date
sp_lines = [ line.split('\t') for line in myList ]
sp_lines.sort( lambda a, b: cmp(toEpoch(a[-2]),toEpoch(b[-2])) )
F = open(m_file,'w')
print "No. of lines: ",len(sp_lines)
# Finally write to the outFile
for line in sp_lines:
MF = '\t'.join(line)
F.write(MF)
F.close()
There is no speed boost at all, it took 2m:50s to process the same 145244 lines. Is anyone see any scope of improvement, please let me know. Thanks to Jakob and eyquem for their time. Cheers!!
Update 4:
Just for future reference, this is a modified version of eyguem, which works much better and faster then the previous ones.
#!/usr/bin/env python
import os, sys, re
from sets import Set as sett
from time import mktime, strptime, strftime
def sorting_merge(o_file, c_file, m_file ):
# RegEx for Date/time filed
pat = re.compile('[0123]\d/[01]\d/\d{2} [012]\d:[0-6]\d:[0-6]\d')
def kl(lines,pat = pat):
# match only the next to last field
line = lines.split('\t')
line = line[-2]
return mktime(strptime((pat.search(line).group()),'%d/%m/%y %H:%M:%S'))
output = sett()
head = []
# Separate the header & remove the duplicates
def rmHead(f_n):
f_n.readline()
for line1 in f_n:
if pat.search(line1): break
else: head.append(line1) # line of the header
for line in f_n:
output.add(line.rstrip())
output.add(line1.rstrip())
f_n.close()
fa = open(o_file, 'r')
rmHead(fa)
fb = open(c_file, 'r')
rmHead(fb)
# Sorting date-wise
output = [ (kl(line),line.rstrip()) for line in output if line.rstrip() ]
output.sort()
fm = open(m_file,'w')
# Write to the file & add the header
fm.write(strftime('On %d/%m/%y %H:%M:%S\n')+(''.join(head[0]+head[1])))
for t,line in output:
fm.write(line + '\n')
fm.close()
c_f = "03_a"
o_f = "03_b"
sorting_merge(o_f, c_f, 'outfile.txt')
This version is much faster - 6.99 sec. for 145244 lines compare to the 2m:47s - then the previous one using lambda a, b: cmp()
. Thanks to eyquem for all his support. Cheers!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
编辑2
我以前的代码有问题
output = sett(line.rstrip() for line in fa)
和output.sort(key=kl)
此外,他们有一些并发症。
因此,我检查了使用 Jakob Bowyer 在其代码中采用的
set()
函数直接读取文件的选择。恭喜 Jakob!(顺便说一下 Michal Chruszcz):
set()
是无与伦比的,它比一次读取一行要快。然后,我就放弃了逐行读取文件的想法。
。
但我保留了我的想法,避免借助 cmp() 函数进行排序,因为正如文档中所述:
然后,我设法获得了一个元组列表 (t,line),其中 t
是的指令
。
我测试了2个代码。在下面的代码中,通过正则表达式计算了行中的第一个日期和时间:
以及第二个代码,其中
kl()
是:。
结果是
该算法比使用函数
cmp() :
其中行集 output 不会在元组列表中进行转换,
而仅在行列表中进行转换(然后没有重复项)并使用函数进行排序mycmp() (参见文档):
执行时间为
。
代码:
这一次,我希望它能够正确运行,唯一要做的就是等待比我测试代码的文件大得多的实际文件的执行时间
。
编辑3
。
编辑4
EDIT 2
My previous codes have problems with
output = sett(line.rstrip() for line in fa)
andoutput.sort(key=kl)
Moreover, they have some complications.
So I examined the choice of reading the files directly with a
set()
function taken by Jakob Bowyer in his code.Congratulations Jakob ! (and Michal Chruszcz by the way) :
set()
is unbeatable, it's faster than a reading one line at a time.Then , I abandonned my idea to read the files line after line.
.
But I kept my idea to avoid a sorting with the help of
cmp()
function because, as it is described in the doc:Then, I managed to obtain a list of tuples (t,line) in which the t is
by the instruction
.
I tested 2 codes. The following one in which 1st date-and-hour in line is computed thanks to a regex:
And a second code in which
kl()
is:.
The results are
This algorithm is faster than a code using a function
cmp()
:a code in which the set of lines output isn't transformed in a list of tuples by
but is only transformed in a list of the lines (without duplicates, then) and sorted with a function mycmp() (see the doc):
has an execution time of
.
The code:
This time, I hope it will run correctly, and that the only thing to do is to wait the times of execution on real files much bigger than the ones on which I tested the codes
.
EDIT 3
.
EDIT 4
也许是这样的?
编辑:忘记了:$
Maybe something along these lines?
EDIT: Forgot about with :$
我写了这个新代码,用起来很方便。它比我以前的代码更快。而且,看起来,比您的代码
请注意,此代码在合并文件中放置了一个标题
。
编辑
Aaaaaah...我明白了...:-))
执行时间除以 3!
I wrote this new code, with the ease of using a set. It is faster that my previous code. And, it seems, than your code
Note that this code put a heading in the merged file
.
EDIT
Aaaaaah... I got it... :-))
Execution's time divided by 3 !
最后的代码,我希望。
因为我发现了杀手代码。
首先,我创建了两个文件“xxA.txt”和“yyB.txt”,共 30 行,共 30000 行,
代码如下:
create AB.py
然后我运行了这 3 个程序:
“merging regex.py”
“merging split.txt”。 py”
“合并杀手”
结果
在杀手程序期间,排序输出需要更长的时间 4 倍,但创建输出作为列表的时间除以 21!
那么全局来说,执行时间至少减少了85%。
Last codes, I hope.
Because I found a killer code.
First , I created two files "xxA.txt" and "yyB.txt" of 30 lines having 30000 lines as
with the following code:
create AB.py
Then I ran these 3 programs:
"merging regex.py"
"merging split.py"
"merging killer"
results
During killer program, sorting output takes 4 times longer , but time of creation of output as a list is divided by 21 !
Then globaly, the execution's time is reduced at least by 85 %.