Python 中的程序控制流

发布于 2024-07-17 03:53:56 字数 833 浏览 14 评论 0原文

我已经将一些数据存储在列表中,如果打印出该列表,我会看到以下内容:

.
.
.
007 A000000 Y
007 B000000  5
007 C010100  1
007 C020100 ACORN FUND
007 C030100 N
007 C010200  2
007 C020200 ACORN INTERNATIONAL
007 C030200 N
007 C010300  3
007 C020300 ACORN USA
007 C030300 N
007 C010400  4
.
.
.

序列之前和之后的点表示存在结构相似但可能是也可能不是的其他数据这第七项(007)。 如果第七项中的第一个值是“007 A000000 Y”,那么我想创建一些数据项的字典列表。 我可以做到这一点,并且已经通过运行列表中的所有项目并将它们的值与变量的一些测试值进行比较来做到这一点。 例如,一行代码如下:

if dataLine.find('007 B')==0:
    numberOfSeries=int(dataLine.split()[2])

但我想做的是

if dataLine.find(''007 A000000 Y')==0:
    READ THE NEXT LINE RIGHT HERE

现在我必须为每个周期迭代整个列表

我想缩短处理时间,因为我有大约 60K 个文件,每个文件有 500 到 5,000 行。

我考虑过创建另一个对列表的引用并计算数据线,直到 dataLine.find(''007 A000000 Y')==0。 但这似乎并不是最优雅的解决方案。

I have some data that I have stored in a list and if I print out the list I see the following:

.
.
.
007 A000000 Y
007 B000000  5
007 C010100  1
007 C020100 ACORN FUND
007 C030100 N
007 C010200  2
007 C020200 ACORN INTERNATIONAL
007 C030200 N
007 C010300  3
007 C020300 ACORN USA
007 C030300 N
007 C010400  4
.
.
.

The dots before and after the sequence are to represent that there is other data that is similarily structured but might or might not not be part of this seventh item (007). if the first value in the seventh item is '007 A000000 Y' then I want to create a dictionary listing of some of the data items. I can do this and have done so by just running through all of the items in my list and comparing their values to some test values for the variables. For instance a line of code like:

if dataLine.find('007 B')==0:
    numberOfSeries=int(dataLine.split()[2])

What I want to do though is

if dataLine.find(''007 A000000 Y')==0:
    READ THE NEXT LINE RIGHT HERE

Right now I am having to iterate through the entire list for each cycle

I want to shorten the processing because I have about 60K files that have between 500 to 5,000 lines in each.

I have thought about creating another reference to the list and counting the datalines until dataLine.find(''007 A000000 Y')==0. But that does not seem like it is the most elegant solution.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

柠檬心 2024-07-24 03:53:56

您可以使用 itertools.groupby() 将序列分割成多个子序列。

import itertools

for key, subseq in itertools.groupby(tempans, lambda s: s.partition(' ')[0]):
    if key == '007':
    for dataLine in subseq:
        if dataLine.startswith('007 B'):
        numberOfSeries = int(dataLine.split()[2])

如果您真的只想查找该行,那么 itertools.dropwhile() 也可以工作,

list(itertools.dropwhile(lambda s: s != '007 A000000 Y', tempans))
['007 A000000 Y',
 '007 B000000  5',
 '007 C010100  1',
 '007 C020100 ACORN FUND',
 '007 C030100 N',
 '007 C010200  2',
 '007 C020200 ACORN INTERNATIONAL',
 '007 C030200 N',
 '007 C010300  3',
 '007 C020300 ACORN USA',
 '007 C030300 N',
 '007 C010400  4',
 '.',
 '.',
 '.',
 '']

You can use itertools.groupby() to segment your sequence into multiple sub-sequences.

import itertools

for key, subseq in itertools.groupby(tempans, lambda s: s.partition(' ')[0]):
    if key == '007':
    for dataLine in subseq:
        if dataLine.startswith('007 B'):
        numberOfSeries = int(dataLine.split()[2])

itertools.dropwhile() would also work if you really just want to seek up to that line,

list(itertools.dropwhile(lambda s: s != '007 A000000 Y', tempans))
['007 A000000 Y',
 '007 B000000  5',
 '007 C010100  1',
 '007 C020100 ACORN FUND',
 '007 C030100 N',
 '007 C010200  2',
 '007 C020200 ACORN INTERNATIONAL',
 '007 C030200 N',
 '007 C010300  3',
 '007 C020300 ACORN USA',
 '007 C030300 N',
 '007 C010400  4',
 '.',
 '.',
 '.',
 '']
蒗幽 2024-07-24 03:53:56

您可以将数据读入字典。 假设您正在读取类似文件的对象infile

from collections import defaultdict
data = defaultdict(list)
for line in infile:
    elements = line.strip().split()
    data[elements[0]].append(tuple(elements[1:]))

现在,如果您想读取“007 A000000 Y”之后的行,您可以这样做:

# find the index of ('A000000', 'Y')
idx = data['007'].index(('A000000', 'Y'))
# get the next line
print data['007'][idx+1]

You could read the data into a dictionary. Assuming you are reading from a file-like object infile:

from collections import defaultdict
data = defaultdict(list)
for line in infile:
    elements = line.strip().split()
    data[elements[0]].append(tuple(elements[1:]))

Now if you want to read the line after '007 A000000 Y', you can do so as:

# find the index of ('A000000', 'Y')
idx = data['007'].index(('A000000', 'Y'))
# get the next line
print data['007'][idx+1]
零時差 2024-07-24 03:53:56

使用字典中的所有数据的唯一困难是字典太大可能会变得很麻烦。 (这就是我们过去所说的“大 Ole 矩阵”方法。)

解决方案是在字典中构造一个索引,使用 创建 key->offset 的映射>tell 方法获取文件偏移值。 然后您可以通过使用 seek 方法进行查找来再次引用该行。

The only difficulty with using all the data in a dictionary is that a really big dictionary can become troublesome. (It's what we used to call the "Big Ole Matrix" approach.)

A solution to this is to construct an index in the Dictionary, creating a mapping of key->offset, using the tell method to get the file offset value. Then you can refer to the line again by seeking with the seek method.

留一抹残留的笑 2024-07-24 03:53:56

您说您想这样做:

if dataLine.find(''007 A000000 Y')==0:
    READ THE NEXT LINE RIGHT HERE

大概这是在“for dataLine in data”循环内。

或者,您可以直接使用迭代器而不是在 for 循环中:

>>> i = iter(data)
>>> while i.next() != '007 A000000 Y': pass  # find your starting line
>>> i.next()  # read the next line
'007 B000000  5'

您还提到要处理 60K 文件。 它们的格式都相似吗? 它们需要进行不同的处理吗? 如果它们都可以以相同的方式处理,您可以考虑将它们链接在一个流程中:

def gfind( directory, pattern="*" ):
    for name in fnmatch.filter( os.listdir( directory ), pattern ):
        yield os.path.join( directory, name )

def gopen( names ):
    for name in names:
        yield open(name, 'rb')

def gcat( files ):
    for file in files:
        for line in file:
            yield line

data = gcat( gopen( gfind( 'C:\datafiles', '*.dat' ) ) )

这使您可以在单个迭代器中延迟处理所有文件。 不确定这是否对您目前的情况有帮助,但我认为值得一提。

You said you wanted to do this:

if dataLine.find(''007 A000000 Y')==0:
    READ THE NEXT LINE RIGHT HERE

Presumably this is within a "for dataLine in data" loop.

Alternatively, you could use an iterator directly instead of in a for loop:

>>> i = iter(data)
>>> while i.next() != '007 A000000 Y': pass  # find your starting line
>>> i.next()  # read the next line
'007 B000000  5'

You also mention having 60K files to process. Are they all formatted similarly? Do they need to be processed differently? If they can all be processed the same way, you could consider chaining them together in a single flow:

def gfind( directory, pattern="*" ):
    for name in fnmatch.filter( os.listdir( directory ), pattern ):
        yield os.path.join( directory, name )

def gopen( names ):
    for name in names:
        yield open(name, 'rb')

def gcat( files ):
    for file in files:
        for line in file:
            yield line

data = gcat( gopen( gfind( 'C:\datafiles', '*.dat' ) ) )

This lets you lazily process all your files in a single iterator. Not sure if that helps your current situation but I thought it worth mentioning.

桃扇骨 2024-07-24 03:53:56

好吧,当我在谷歌上搜索以确保我已经涵盖了我的基础时,我遇到了一个解决方案:

我发现即使我使用列表和字典,我也会忘记思考它们。 Python 有一些强大的工具可以处理这些类型,以提高您操作它们的能力。
轻松获得切片引用

beginPosit = tempans.index('007 A000000 Y')
endPosit = min([i for i, item in enumerate(tempans) if '008 ' in item])

我需要一个切片,以便可以通过tempans 是数据列表的位置
现在我可以写

for line in tempans[beginPosit:endPosit]:
    process each line

我想我回答了我自己的问题。 我从其他答案中学到了很多东西并欣赏它们,但我认为这就是我需要的

好吧,我将进一步编辑我的答案。 我在这里学到了很多东西,但其中一些东西仍然超出了我的能力范围,我想在更多地了解这个神奇工具的同时编写一些代码。

from itertools import takewhile
beginPosit = tempans.index('007 A000000 Y')
new=takewhile(lambda x: '007 ' in x, tempans[beginPosit:])

这是基于对类似问题的早期回答和 Steven Huwig 的 回答

Okay-while I was Googling to make sure I had covered my bases I came across a solution:

I find that I forget to think in Lists and Dictionaries even though I use them. Python has some powerful tools to work with these types to speed your ability to manipulate them.
I need a slice so the slice references are easily obtained by

beginPosit = tempans.index('007 A000000 Y')
endPosit = min([i for i, item in enumerate(tempans) if '008 ' in item])

where tempans is the datalist
now I can write

for line in tempans[beginPosit:endPosit]:
    process each line

I think I answered my own question. I learned alot from the other answers and appreciate them but I think this is what I needed

Okay I am going to further edit my answer. I have learned a lot here but some of this stuff is over my head still and I want to get some code written while I am learning more about this fantastic tool.

from itertools import takewhile
beginPosit = tempans.index('007 A000000 Y')
new=takewhile(lambda x: '007 ' in x, tempans[beginPosit:])

This is based on an earlier answer to a similar question and Steven Huwig's answer

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文