您如何使用Python打开并阅读大文件?

发布于 2025-02-06 05:47:25 字数 1413 浏览 0 评论 0原文

基本任务是编写一个函数,get_words_from_file(文件名),该函数返回感兴趣区域内的较低案例单词的列表。他们与您分享正则表达式:“ [az]+[ - '] [az]+| [az]+[']?| [az]+”,找到所有符合此定义的单词。我的代码在某些测试上效果很好,但在较大的文件上失败了,因此我认为我为较大的文件打开了错误的文件。 这些测试没有问题;

filename = "abc.txt"
words2 = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words2)))
print("Valid word list:")
print("\n".join(words2))

#or

filename = "synthetic.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
    print(word)

#my code:
import re

def get_words_from_file(filename):
    """Returns a list of lower case words that are with the region of interest, every 
    word in the text file, but, not any of the punctuation."""    

    with open(filename, 'r', encoding='utf-8') as file:
        flag = False
        words = []
        for line in file:
            if(str(line).strip()=="*** START OF"):
                flag=True
            elif(str(line).strip()=="*** END "):
                flag=False
                break       
            elif(flag):
                new_line = line.lower()
                words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", 
                                           new_line)
                words.extend(words_on_line)
        return words

任何帮助都很棒!

The basic task is to write a function, get_words_from_file(filename), that returns a list of lower case words that are within the region of interest. They share with you a regular expression: "[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", that finds all words that meet this definition. My code works well on some of the tests but fails on a larger file so I think I'm opening the file wrong for bigger files.
Have no problem with these tests;

filename = "abc.txt"
words2 = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words2)))
print("Valid word list:")
print("\n".join(words2))

#or

filename = "synthetic.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
    print(word)

#my code:
import re

def get_words_from_file(filename):
    """Returns a list of lower case words that are with the region of interest, every 
    word in the text file, but, not any of the punctuation."""    

    with open(filename, 'r', encoding='utf-8') as file:
        flag = False
        words = []
        for line in file:
            if(str(line).strip()=="*** START OF"):
                flag=True
            elif(str(line).strip()=="*** END "):
                flag=False
                break       
            elif(flag):
                new_line = line.lower()
                words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", 
                                           new_line)
                words.extend(words_on_line)
        return words

Any help would be awesome!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文