在用 Python 编写之前检查 txt 文件中的匹配项

发布于 2024-11-26 07:13:46 字数 918 浏览 1 评论 0原文

我正在处理一个非常大的文本文件(500MB+),我的代码输出完美,但我得到了很多重复项。我想要做的是在写入文件之前检查输出文件以查看输出是否存在。我确信这只是 if 语句中的一行,但我不太了解 python,无法弄清楚语法。任何帮助将不胜感激。

这是代码:

authorList = ['Shakes.','Scott']

with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
    with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
            the_whole_file = open_file.read()
            for x in authorList:
                start_position = 0 
                while True:
                   start_position = the_whole_file.find('<A>'+x+'</A>', start_position)
                   if start_position < 0:
                       break
                   end_position = the_whole_file.find('</W>', start_position)
                   output_file.write(the_whole_file[start_position:end_position+4])
                   output_file.write("\n")    
                   start_position = end_position + 4

I am working with a very large text file (500MB+) and the code I have is outputting perfectly but I am getting a lot of duplicates. What I am looking to do is check the output file to see if the output exists before it writes to the file. I am sure it is just one line in an if statement, but I do not know python well and cannot figure out the syntax. Any help would be greatly appreciated.

Here is the code:

authorList = ['Shakes.','Scott']

with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
    with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
            the_whole_file = open_file.read()
            for x in authorList:
                start_position = 0 
                while True:
                   start_position = the_whole_file.find('<A>'+x+'</A>', start_position)
                   if start_position < 0:
                       break
                   end_position = the_whole_file.find('</W>', start_position)
                   output_file.write(the_whole_file[start_position:end_position+4])
                   output_file.write("\n")    
                   start_position = end_position + 4

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

花落人断肠 2024-12-03 07:13:46

我建议你简单地记录下你已经看过的作者数据,只有在你以前没有看过的情况下才写下来。您可以使用dict来跟踪。

authorList = ['Shakes.','Scott']
already_seen = {} # dict to keep track of what has been seen

with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
    with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
            the_whole_file = open_file.read()
            for x in authorList:
                start_position = 0 
                while True:
                   start_position = the_whole_file.find('<A>'+x+'</A>', start_position)
                   if start_position < 0:
                       break
                   end_position = the_whole_file.find('</W>', start_position)
                   author_data = the_whole_file[start_position:end_position+4]
                   if author_data not in already_seen:
                       output_file.write(author_data + "\n")
                       already_seen[author_data] = True
                   start_position = end_position + 4

I suggest that you simply keep track of which author data you have already seen, and only write it if you haven't seen it before. You can use a dict to keep track.

authorList = ['Shakes.','Scott']
already_seen = {} # dict to keep track of what has been seen

with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
    with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
            the_whole_file = open_file.read()
            for x in authorList:
                start_position = 0 
                while True:
                   start_position = the_whole_file.find('<A>'+x+'</A>', start_position)
                   if start_position < 0:
                       break
                   end_position = the_whole_file.find('</W>', start_position)
                   author_data = the_whole_file[start_position:end_position+4]
                   if author_data not in already_seen:
                       output_file.write(author_data + "\n")
                       already_seen[author_data] = True
                   start_position = end_position + 4
梦忆晨望 2024-12-03 07:13:46

创建一个列表,其中包含要写入的每个字符串。如果您追加它,请首先检查您追加的项目是否已在列表中。

Create a list holding every string to write. If you append it, check first if the item you append is already in the list or not.

夏了南城 2024-12-03 07:13:46

我的理解是,当您想要写入output_file时,您希望跳过open_file中包含作者姓名的行。如果这是您打算做的,那么就这样做。

authorList = ['Shakes.','Scott']

with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
    with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
         for line in open_file:
              skip = 0
              for author in authorList:
                   if author in line:
                       skip = 1
              if not skip:
                   output_file.write(line)

My understanding is, you wish to skip the lines in the open_file which contains name of your authors when you want to write to output_file. If this is what you intend to do, then do it this way.

authorList = ['Shakes.','Scott']

with open('/Users/Adam/Desktop/Poetrylist.txt','w') as output_file:
    with open('/Users/Adam/Desktop/2e.txt','r') as open_file:
         for line in open_file:
              skip = 0
              for author in authorList:
                   if author in line:
                       skip = 1
              if not skip:
                   output_file.write(line)
甜味拾荒者 2024-12-03 07:13:46

我认为您应该使用适当的工具来处理您的文件来处理文本:正则表达式。

import re

regx = re.compile('<A>(.+?)</A>.*?<W>.*?</W>')

with open('/Users/Desktop/2e.txt','rb')         as open_file,\
     open('/Users/Desktop/Poetrylist.txt','wb') as output_file:

    remain = ''
    seen = set()

    while True:
        chunk = open_file.read(65536) # 65536 == 16 x 16 x 16 x 16
        if not chunk:  break
        for mat in regx.finditer(remain + chunk):
            if mat.group(1) not in seen:
                output_file.write( mat.group() + '\n' )
                seen.add(mat.group(1))
        remain = chunk[mat.end(0)-len(remain):]

I think you should process your file with an appropriate tool to treat a text: regular expressions.

import re

regx = re.compile('<A>(.+?)</A>.*?<W>.*?</W>')

with open('/Users/Desktop/2e.txt','rb')         as open_file,\
     open('/Users/Desktop/Poetrylist.txt','wb') as output_file:

    remain = ''
    seen = set()

    while True:
        chunk = open_file.read(65536) # 65536 == 16 x 16 x 16 x 16
        if not chunk:  break
        for mat in regx.finditer(remain + chunk):
            if mat.group(1) not in seen:
                output_file.write( mat.group() + '\n' )
                seen.add(mat.group(1))
        remain = chunk[mat.end(0)-len(remain):]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文