如何循环遍历文件中的行块？

发布于 2024-09-26 18:52:53 字数 297 浏览 4 评论 0原文

我有一个如下所示的文本文件，其中的行块由空行分隔：

ID: 1
Name: X
FamilyN: Y
Age: 20

ID: 2
Name: H
FamilyN: F
Age: 23

ID: 3
Name: S
FamilyN: Y
Age: 13

ID: 4
Name: M
FamilyN: Z
Age: 25

如何循环遍历块并处理每个块中的数据？最终我想将姓名、姓氏和年龄值收集到三列中，如下所示：

Y X 20
F H 23
Y S 13
Z M 25

原文

I have a text file that looks like this, with blocks of lines separated by blank lines:

ID: 1
Name: X
FamilyN: Y
Age: 20

ID: 2
Name: H
FamilyN: F
Age: 23

ID: 3
Name: S
FamilyN: Y
Age: 13

ID: 4
Name: M
FamilyN: Z
Age: 25

How can I loop through the blocks and process the data in each block? eventually I want to gather the name, family name and age values into three columns, like so:

Y X 20
F H 23
Y S 13
Z M 25

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

送你一个梦 2024-10-03 18:52:53

这是另一种方法，使用 itertools.groupby。
函数groupy 迭代文件的各行，并为每行调用isa_group_separator(line)。 isa_group_separator 返回 True 或 False（称为 key），然后 itertools.groupby 对产生相同 True 或 False 的所有连续行进行分组结果。

这是一种将线路收集到组中的非常方便的方法。

import itertools

def isa_group_separator(line):
    return line=='\n'

with open('data_file') as f:
    for key,group in itertools.groupby(f,isa_group_separator):
        # print(key,list(group))  # uncomment to see what itertools.groupby does.
        if not key:               # however, this will make the rest of the code not work
            data={}               # as it exhausts the `group` iterator
            for item in group:
                field,value=item.split(':')
                value=value.strip()
                data[field]=value
            print('{FamilyN} {Name} {Age}'.format(**data))

# Y X 20
# F H 23
# Y S 13
# Z M 25

Here's another way, using itertools.groupby.
The function groupy iterates through lines of the file and calls isa_group_separator(line) for each line. isa_group_separator returns either True or False (called the key), and itertools.groupby then groups all the consecutive lines that yielded the same True or False result.

This is a very convenient way to collect lines into groups.

import itertools

def isa_group_separator(line):
    return line=='\n'

with open('data_file') as f:
    for key,group in itertools.groupby(f,isa_group_separator):
        # print(key,list(group))  # uncomment to see what itertools.groupby does.
        if not key:               # however, this will make the rest of the code not work
            data={}               # as it exhausts the `group` iterator
            for item in group:
                field,value=item.split(':')
                value=value.strip()
                data[field]=value
            print('{FamilyN} {Name} {Age}'.format(**data))

# Y X 20
# F H 23
# Y S 13
# Z M 25

回复收藏 0 原文

对你的占有欲 2024-10-03 18:52:53

使用发电机。

def blocks( iterable ):
    accumulator= []
    for line in iterable:
        if start_pattern( line ):
            if accumulator:
                yield accumulator
                accumulator= []
        # elif other significant patterns
        else:
            accumulator.append( line )
     if accumulator:
         yield accumulator

Use a generator.

def blocks( iterable ):
    accumulator= []
    for line in iterable:
        if start_pattern( line ):
            if accumulator:
                yield accumulator
                accumulator= []
        # elif other significant patterns
        else:
            accumulator.append( line )
     if accumulator:
         yield accumulator

回复收藏 0 原文

木有鱼丸 2024-10-03 18:52:53

import re
result = re.findall(
    r"""(?mx)           # multiline, verbose regex
    ^ID:.*\s*           # Match ID: and anything else on that line 
    Name:\s*(.*)\s*     # Match name, capture all characters on this line
    FamilyN:\s*(.*)\s*  # etc. for family name
    Age:\s*(.*)$        # and age""", 
    subject)

结果将是

[('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]

可以简单地更改为您想要的任何字符串表示形式。

import re
result = re.findall(
    r"""(?mx)           # multiline, verbose regex
    ^ID:.*\s*           # Match ID: and anything else on that line 
    Name:\s*(.*)\s*     # Match name, capture all characters on this line
    FamilyN:\s*(.*)\s*  # etc. for family name
    Age:\s*(.*)$        # and age""", 
    subject)

Result will then be

[('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]

which can be trivially changed into whatever string representation you want.

回复收藏 0 原文

无戏配角 2024-10-03 18:52:53

如果您的文件太大而无法一次读入内存，您仍然可以通过使用内存映射文件来使用基于正则表达式的解决方案，其中 mmap 模块：

import sys
import re
import os
import mmap

block_expr = re.compile('ID:.*?\nAge: \d+', re.DOTALL)

filepath = sys.argv[1]
fp = open(filepath)
contents = mmap.mmap(fp.fileno(), os.stat(filepath).st_size, access=mmap.ACCESS_READ)

for block_match in block_expr.finditer(contents):
    print block_match.group()

mmap 技巧将提供一个“假装字符串”，使正则表达式可以在文件上工作，而不必将其全部读入一个大字符串。正则表达式对象的 find_iter() 方法将生成匹配项，而无需立即创建所有匹配项的完整列表（findall() 会这样做）。

我确实认为这个解决方案对于这个用例来说是多余的（仍然：这是一个很好的技巧......）

If your file is too large to read into memory all at once, you can still use a regular expressions based solution by using a memory mapped file, with the mmap module:

import sys
import re
import os
import mmap

block_expr = re.compile('ID:.*?\nAge: \d+', re.DOTALL)

filepath = sys.argv[1]
fp = open(filepath)
contents = mmap.mmap(fp.fileno(), os.stat(filepath).st_size, access=mmap.ACCESS_READ)

for block_match in block_expr.finditer(contents):
    print block_match.group()

The mmap trick will provide a "pretend string" to make regular expressions work on the file without having to read it all into one large string. And the find_iter() method of the regular expression object will yield matches without creating an entire list of all matches at once (which findall() does).

I do think this solution is overkill for this use case however (still: it's a nice trick to know...)

回复收藏 0 原文

怂人 2024-10-03 18:52:53

如果文件不大，您可以使用以下命令读取整个文件：

content = f.open(filename).read()

然后您可以使用以下命令将 content 分割为块：

blocks = content.split('\n\n')

现在您可以创建函数来解析文本块。我将使用 split('\n') 从块获取行，并使用 split(':') 获取键和值，最终使用 str.strip () 或正则表达式的一些帮助。

在不检查块是否具有所需数据的情况下，代码可能如下所示：

f = open('data.txt', 'r')
content = f.read()
f.close()
for block in content.split('\n\n'):
    person = {}
    for l in block.split('\n'):
        k, v = l.split(': ')
        person[k] = v
    print('%s %s %s' % (person['FamilyN'], person['Name'], person['Age']))

If file is not huge you can read whole file with:

content = f.open(filename).read()

then you can split content to blocks using:

blocks = content.split('\n\n')

Now you can create function to parse block of text. I would use split('\n') to get lines from block and split(':') to get key and value, eventually with str.strip() or some help of regular expressions.

Without checking if block has required data code can look like:

f = open('data.txt', 'r')
content = f.read()
f.close()
for block in content.split('\n\n'):
    person = {}
    for l in block.split('\n'):
        k, v = l.split(': ')
        person[k] = v
    print('%s %s %s' % (person['FamilyN'], person['Name'], person['Age']))

回复收藏 0 原文

难理解 2024-10-03 18:52:53

import itertools

# Assuming input in file input.txt
data = open('input.txt').readlines()

records = (lines for valid, lines in itertools.groupby(data, lambda l : l != '\n') if valid)    
output = [tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records]

# You can change output to generator by    
output = (tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records)

# output = [('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]    
#You can iterate and change the order of elements in the way you want    
# [(elem[1], elem[0], elem[2]) for elem in output] as required in your output

import itertools

# Assuming input in file input.txt
data = open('input.txt').readlines()

records = (lines for valid, lines in itertools.groupby(data, lambda l : l != '\n') if valid)    
output = [tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records]

# You can change output to generator by    
output = (tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records)

# output = [('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]    
#You can iterate and change the order of elements in the way you want    
# [(elem[1], elem[0], elem[2]) for elem in output] as required in your output

回复收藏 0 原文

成熟的代价 2024-10-03 18:52:53

这个答案不一定比已经发布的更好，但作为我如何处理此类问题的说明，它可能很有用，特别是如果您不习惯使用 Python 的交互式解释器。

我开始知道关于这个问题的两件事。首先，我将使用 itertools.groupby 将输入分组到数据行列表中，每个单独的数据记录对应一个列表。其次，我想将这些记录表示为字典，以便我可以轻松格式化输出。

这表明的另一件事是使用生成器如何轻松地将此类问题分解为小部分。

>>> # first let's create some useful test data and put it into something 
>>> # we can easily iterate over:
>>> data = """ID: 1
Name: X
FamilyN: Y
Age: 20

ID: 2
Name: H
FamilyN: F
Age: 23

ID: 3
Name: S
FamilyN: Y
Age: 13"""
>>> data = data.split("\n")
>>> # now we need a key function for itertools.groupby.
>>> # the key we'll be grouping by is, essentially, whether or not
>>> # the line is empty.
>>> # this will make groupby return groups whose key is True if we
>>> care about them.
>>> def is_data(line):
        return True if line.strip() else False

>>> # make sure this really works
>>> "\n".join([line for line in data if is_data(line)])
'ID: 1\nName: X\nFamilyN: Y\nAge: 20\nID: 2\nName: H\nFamilyN: F\nAge: 23\nID: 3\nName: S\nFamilyN: Y\nAge: 13\nID: 4\nName: M\nFamilyN: Z\nAge: 25'

>>> # does groupby return what we expect?
>>> import itertools
>>> [list(value) for (key, value) in itertools.groupby(data, is_data) if key]
[['ID: 1', 'Name: X', 'FamilyN: Y', 'Age: 20'], ['ID: 2', 'Name: H', 'FamilyN: F', 'Age: 23'], ['ID: 3', 'Name: S', 'FamilyN: Y', 'Age: 13'], ['ID: 4', 'Name: M', 'FamilyN: Z', 'Age: 25']]
>>> # what we really want is for each item in the group to be a tuple
>>> # that's a key/value pair, so that we can easily create a dictionary
>>> # from each item.
>>> def make_key_value_pair(item):
        items = item.split(":")
        return (items[0].strip(), items[1].strip())

>>> make_key_value_pair("a: b")
('a', 'b')
>>> # let's test this:
>>> dict(make_key_value_pair(item) for item in ["a:1", "b:2", "c:3"])
{'a': '1', 'c': '3', 'b': '2'}
>>> # we could conceivably do all this in one line of code, but this 
>>> # will be much more readable as a function:
>>> def get_data_as_dicts(data):
        for (key, value) in itertools.groupby(data, is_data):
            if key:
                yield dict(make_key_value_pair(item) for item in value)

>>> list(get_data_as_dicts(data))
[{'FamilyN': 'Y', 'Age': '20', 'ID': '1', 'Name': 'X'}, {'FamilyN': 'F', 'Age': '23', 'ID': '2', 'Name': 'H'}, {'FamilyN': 'Y', 'Age': '13', 'ID': '3', 'Name': 'S'}, {'FamilyN': 'Z', 'Age': '25', 'ID': '4', 'Name': 'M'}]
>>> # now for an old trick:  using a list of column names to drive the output.
>>> columns = ["Name", "FamilyN", "Age"]
>>> print "\n".join(" ".join(d[c] for c in columns) for d in get_data_as_dicts(data))
X Y 20
H F 23
S Y 13
M Z 25
>>> # okay, let's package this all into one function that takes a filename
>>> def get_formatted_data(filename):
        with open(filename, "r") as f:
            columns = ["Name", "FamilyN", "Age"]
            for d in get_data_as_dicts(f):
                yield " ".join(d[c] for c in columns)

>>> print "\n".join(get_formatted_data("c:\\temp\\test_data.txt"))
X Y 20
H F 23
S Y 13
M Z 25

This answer isn't necessarily better than what's already been posted, but as an illustration of how I approach problems like this it might be useful, especially if you're not used to working with Python's interactive interpreter.

I've started out knowing two things about this problem. First, I'm going to use itertools.groupby to group the input into lists of data lines, one list for each individual data record. Second, I want to represent those records as dictionaries so that I can easily format the output.

One other thing that this shows is how using generators makes breaking a problem like this down into small parts easy.

>>> # first let's create some useful test data and put it into something 
>>> # we can easily iterate over:
>>> data = """ID: 1
Name: X
FamilyN: Y
Age: 20

ID: 2
Name: H
FamilyN: F
Age: 23

ID: 3
Name: S
FamilyN: Y
Age: 13"""
>>> data = data.split("\n")
>>> # now we need a key function for itertools.groupby.
>>> # the key we'll be grouping by is, essentially, whether or not
>>> # the line is empty.
>>> # this will make groupby return groups whose key is True if we
>>> care about them.
>>> def is_data(line):
        return True if line.strip() else False

>>> # make sure this really works
>>> "\n".join([line for line in data if is_data(line)])
'ID: 1\nName: X\nFamilyN: Y\nAge: 20\nID: 2\nName: H\nFamilyN: F\nAge: 23\nID: 3\nName: S\nFamilyN: Y\nAge: 13\nID: 4\nName: M\nFamilyN: Z\nAge: 25'

>>> # does groupby return what we expect?
>>> import itertools
>>> [list(value) for (key, value) in itertools.groupby(data, is_data) if key]
[['ID: 1', 'Name: X', 'FamilyN: Y', 'Age: 20'], ['ID: 2', 'Name: H', 'FamilyN: F', 'Age: 23'], ['ID: 3', 'Name: S', 'FamilyN: Y', 'Age: 13'], ['ID: 4', 'Name: M', 'FamilyN: Z', 'Age: 25']]
>>> # what we really want is for each item in the group to be a tuple
>>> # that's a key/value pair, so that we can easily create a dictionary
>>> # from each item.
>>> def make_key_value_pair(item):
        items = item.split(":")
        return (items[0].strip(), items[1].strip())

>>> make_key_value_pair("a: b")
('a', 'b')
>>> # let's test this:
>>> dict(make_key_value_pair(item) for item in ["a:1", "b:2", "c:3"])
{'a': '1', 'c': '3', 'b': '2'}
>>> # we could conceivably do all this in one line of code, but this 
>>> # will be much more readable as a function:
>>> def get_data_as_dicts(data):
        for (key, value) in itertools.groupby(data, is_data):
            if key:
                yield dict(make_key_value_pair(item) for item in value)

>>> list(get_data_as_dicts(data))
[{'FamilyN': 'Y', 'Age': '20', 'ID': '1', 'Name': 'X'}, {'FamilyN': 'F', 'Age': '23', 'ID': '2', 'Name': 'H'}, {'FamilyN': 'Y', 'Age': '13', 'ID': '3', 'Name': 'S'}, {'FamilyN': 'Z', 'Age': '25', 'ID': '4', 'Name': 'M'}]
>>> # now for an old trick:  using a list of column names to drive the output.
>>> columns = ["Name", "FamilyN", "Age"]
>>> print "\n".join(" ".join(d[c] for c in columns) for d in get_data_as_dicts(data))
X Y 20
H F 23
S Y 13
M Z 25
>>> # okay, let's package this all into one function that takes a filename
>>> def get_formatted_data(filename):
        with open(filename, "r") as f:
            columns = ["Name", "FamilyN", "Age"]
            for d in get_data_as_dicts(f):
                yield " ".join(d[c] for c in columns)

>>> print "\n".join(get_formatted_data("c:\\temp\\test_data.txt"))
X Y 20
H F 23
S Y 13
M Z 25

回复收藏 0 原文

夏末染殇 2024-10-03 18:52:53

使用字典、namedtuple 或自定义类来存储遇到的每个属性，然后在到达空行或 EOF 时将对象附加到列表中。

回复收藏 0 原文

要走干脆点 2024-10-03 18:52:53

简单的解决方案：

result = []
for record in content.split('\n\n'):
    try:
        id, name, familyn, age = map(lambda rec: rec.split(' ', 1)[1], record.split('\n'))
    except ValueError:
        pass
    except IndexError:
        pass
    else:
        result.append((familyn, name, age))

simple solution:

result = []
for record in content.split('\n\n'):
    try:
        id, name, familyn, age = map(lambda rec: rec.split(' ', 1)[1], record.split('\n'))
    except ValueError:
        pass
    except IndexError:
        pass
    else:
        result.append((familyn, name, age))

回复收藏 0 原文

何以笙箫默 2024-10-03 18:52:53

除了我在这里已经看到的六种其他解决方案之外，我有点惊讶的是，没有人如此简单地提出建议（即生成器、正则表达式、映射和免读取），例如，

fp = open(fn)
def get_one_value():
    line = fp.readline()
    if not line:
        return None
    parts = line.split(':')
    if 2 != len(parts):
        return ''
    return parts[1].strip()

# The result is supposed to be a list.
result = []
while 1:
        # We don't care about the ID.
   if get_one_value() is None:
       break
   name = get_one_value()
   familyn = get_one_value()
   age = get_one_value()
   result.append((name, familyn, age))
       # We don't care about the block separator.
   if get_one_value() is None:
       break

for item in result:
    print item

重新格式化以适应口味。

Along with the half-dozen other solutions I already see here, I'm a bit surprised that no one has been so simple-minded (that is, generator-, regex-, map-, and read-free) as to propose, for example,

fp = open(fn)
def get_one_value():
    line = fp.readline()
    if not line:
        return None
    parts = line.split(':')
    if 2 != len(parts):
        return ''
    return parts[1].strip()

# The result is supposed to be a list.
result = []
while 1:
        # We don't care about the ID.
   if get_one_value() is None:
       break
   name = get_one_value()
   familyn = get_one_value()
   age = get_one_value()
   result.append((name, familyn, age))
       # We don't care about the block separator.
   if get_one_value() is None:
       break

for item in result:
    print item

Re-format to taste.

回复收藏 0 原文

~没有更多了~