在用 Python 编写之前检查 txt 文件中的匹配项
我正在处理一个非常大的文本文件(500MB+),我的代码输出完美,但我得到了很多重复项。我想要做的是在写入文件之前检查输出文件以查看输出是否存在。我确信这只是 if 语句中的一行,但我不太了解 python,无法弄清楚语法。任何帮助将不胜感激。
这是代码:
authorList = ['Shakes.','Scott']
with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
the_whole_file = open_file.read()
for x in authorList:
start_position = 0
while True:
start_position = the_whole_file.find('<A>'+x+'</A>', start_position)
if start_position < 0:
break
end_position = the_whole_file.find('</W>', start_position)
output_file.write(the_whole_file[start_position:end_position+4])
output_file.write("\n")
start_position = end_position + 4
I am working with a very large text file (500MB+) and the code I have is outputting perfectly but I am getting a lot of duplicates. What I am looking to do is check the output file to see if the output exists before it writes to the file. I am sure it is just one line in an if statement, but I do not know python well and cannot figure out the syntax. Any help would be greatly appreciated.
Here is the code:
authorList = ['Shakes.','Scott']
with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
the_whole_file = open_file.read()
for x in authorList:
start_position = 0
while True:
start_position = the_whole_file.find('<A>'+x+'</A>', start_position)
if start_position < 0:
break
end_position = the_whole_file.find('</W>', start_position)
output_file.write(the_whole_file[start_position:end_position+4])
output_file.write("\n")
start_position = end_position + 4
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我建议你简单地记录下你已经看过的作者数据,只有在你以前没有看过的情况下才写下来。您可以使用
dict
来跟踪。I suggest that you simply keep track of which author data you have already seen, and only write it if you haven't seen it before. You can use a
dict
to keep track.创建一个列表,其中包含要写入的每个字符串。如果您追加它,请首先检查您追加的项目是否已在列表中。
Create a list holding every string to write. If you append it, check first if the item you append is already in the list or not.
我的理解是,当您想要写入output_file时,您希望跳过open_file中包含作者姓名的行。如果这是您打算做的,那么就这样做。
My understanding is, you wish to skip the lines in the open_file which contains name of your authors when you want to write to output_file. If this is what you intend to do, then do it this way.
我认为您应该使用适当的工具来处理您的文件来处理文本:正则表达式。
I think you should process your file with an appropriate tool to treat a text: regular expressions.