在Python中高效解析大文本文件?

发布于 2024-12-15 16:53:37 字数 634 浏览 2 评论 0原文

我有一系列大型平面文本文件,需要对其进行解析才能插入 SQL 数据库。每条记录跨越多行,并由大约一百个固定长度字段组成。我试图找出如何有效地解析它们而不将整个文件加载到内存中。

每条记录都以数字“1”作为新行的第一个字符(尽管并非每条以“1”开头的行都是新记录),并在后面以一系列 20 个空格结束许多行。虽然每个字段都是固定宽度的,但每个记录都是可变长度的,因为它可能包含也可能不包含多个可选字段。所以我一直使用 "...20 个空格...\n1" 作为记录分隔符。

我一直在尝试使用类似的方法一次处理 1kb:

def read_in_chunks(file_object, chunk_size):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

file = open('test.txt')
for piece in read_in_chunks(file, chunk_size=1024):
   # Do stuff

但是,我遇到的问题是当单个记录跨越多个块时。我是否忽略了一个明显的设计模式?这个问题似乎有些普遍。谢谢!

I have a series of large, flat text files that I need to parse in order insert into a SQL database. Each record spans multiple lines and consists of about a hundred fixed-length fields. I am trying to figure out how to efficiently parse them without loading the entire file into memory.

Each record starts with a numeric "1" as the first character on a new line (though not every line that starts with "1" is a new record) and terminates many lines later with a series of 20 spaces. While each field is fixed-width, each record is variable-length because it may or may not contain several optional fields. So I've been using "...20 spaces...\n1" as a record delimiter.

I've been trying to work with something like this to process 1kb at a time:

def read_in_chunks(file_object, chunk_size):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

file = open('test.txt')
for piece in read_in_chunks(file, chunk_size=1024):
   # Do stuff

However, the problem I'm running into is when a single record spans multiple chunks. Am I overlooking an obvious design pattern? This problem would seem to be somewhat common. Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

佞臣 2024-12-22 16:53:37
def recordsFromFile(inputFile):
    record = ''
    terminator = ' ' * 20
    for line in inputFile:
        if line.startswith('1') and record.endswith(terminator):
            yield record
            record = ''
        record += line
    yield record

inputFile = open('test.txt')
for record in recordsFromFile(inputFile):
    # Do stuff

顺便说一句,file 是一个内置函数。改变它的值是不好的风格。

def recordsFromFile(inputFile):
    record = ''
    terminator = ' ' * 20
    for line in inputFile:
        if line.startswith('1') and record.endswith(terminator):
            yield record
            record = ''
        record += line
    yield record

inputFile = open('test.txt')
for record in recordsFromFile(inputFile):
    # Do stuff

BTW, file is a built-in function. It's bad style to change its value.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文