您如何使用Python打开并阅读大文件？

发布于 2025-02-06 05:47:25 字数 1413 浏览 0 评论 0原文

基本任务是编写一个函数，get_words_from_file（文件名），该函数返回感兴趣区域内的较低案例单词的列表。他们与您分享正则表达式：“ [az]+[ - '] [az]+| [az]+[']？| [az]+”，找到所有符合此定义的单词。我的代码在某些测试上效果很好，但在较大的文件上失败了，因此我认为我为较大的文件打开了错误的文件。这些测试没有问题；

filename = "abc.txt"
words2 = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words2)))
print("Valid word list:")
print("\n".join(words2))

#or

filename = "synthetic.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
    print(word)

#my code:
import re

def get_words_from_file(filename):
    """Returns a list of lower case words that are with the region of interest, every 
    word in the text file, but, not any of the punctuation."""    

    with open(filename, 'r', encoding='utf-8') as file:
        flag = False
        words = []
        for line in file:
            if(str(line).strip()=="*** START OF"):
                flag=True
            elif(str(line).strip()=="*** END "):
                flag=False
                break       
            elif(flag):
                new_line = line.lower()
                words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", 
                                           new_line)
                words.extend(words_on_line)
        return words

任何帮助都很棒！

原文

The basic task is to write a function, get_words_from_file(filename), that returns a list of lower case words that are within the region of interest. They share with you a regular expression: "[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", that finds all words that meet this definition. My code works well on some of the tests but fails on a larger file so I think I'm opening the file wrong for bigger files.
Have no problem with these tests;

filename = "abc.txt"
words2 = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words2)))
print("Valid word list:")
print("\n".join(words2))

#or

filename = "synthetic.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
    print(word)

#my code:
import re

def get_words_from_file(filename):
    """Returns a list of lower case words that are with the region of interest, every 
    word in the text file, but, not any of the punctuation."""    

    with open(filename, 'r', encoding='utf-8') as file:
        flag = False
        words = []
        for line in file:
            if(str(line).strip()=="*** START OF"):
                flag=True
            elif(str(line).strip()=="*** END "):
                flag=False
                break       
            elif(flag):
                new_line = line.lower()
                words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", 
                                           new_line)
                words.extend(words_on_line)
        return words

Any help would be awesome!

分享到QQ

分享到微博