Python 中的程序控制流
我已经将一些数据存储在列表中,如果打印出该列表,我会看到以下内容:
.
.
.
007 A000000 Y
007 B000000 5
007 C010100 1
007 C020100 ACORN FUND
007 C030100 N
007 C010200 2
007 C020200 ACORN INTERNATIONAL
007 C030200 N
007 C010300 3
007 C020300 ACORN USA
007 C030300 N
007 C010400 4
.
.
.
序列之前和之后的点表示存在结构相似但可能是也可能不是的其他数据这第七项(007)。 如果第七项中的第一个值是“007 A000000 Y”,那么我想创建一些数据项的字典列表。 我可以做到这一点,并且已经通过运行列表中的所有项目并将它们的值与变量的一些测试值进行比较来做到这一点。 例如,一行代码如下:
if dataLine.find('007 B')==0:
numberOfSeries=int(dataLine.split()[2])
但我想做的是
if dataLine.find(''007 A000000 Y')==0:
READ THE NEXT LINE RIGHT HERE
现在我必须为每个周期迭代整个列表
我想缩短处理时间,因为我有大约 60K 个文件,每个文件有 500 到 5,000 行。
我考虑过创建另一个对列表的引用并计算数据线,直到 dataLine.find(''007 A000000 Y')==0。 但这似乎并不是最优雅的解决方案。
I have some data that I have stored in a list and if I print out the list I see the following:
.
.
.
007 A000000 Y
007 B000000 5
007 C010100 1
007 C020100 ACORN FUND
007 C030100 N
007 C010200 2
007 C020200 ACORN INTERNATIONAL
007 C030200 N
007 C010300 3
007 C020300 ACORN USA
007 C030300 N
007 C010400 4
.
.
.
The dots before and after the sequence are to represent that there is other data that is similarily structured but might or might not not be part of this seventh item (007). if the first value in the seventh item is '007 A000000 Y' then I want to create a dictionary listing of some of the data items. I can do this and have done so by just running through all of the items in my list and comparing their values to some test values for the variables. For instance a line of code like:
if dataLine.find('007 B')==0:
numberOfSeries=int(dataLine.split()[2])
What I want to do though is
if dataLine.find(''007 A000000 Y')==0:
READ THE NEXT LINE RIGHT HERE
Right now I am having to iterate through the entire list for each cycle
I want to shorten the processing because I have about 60K files that have between 500 to 5,000 lines in each.
I have thought about creating another reference to the list and counting the datalines until dataLine.find(''007 A000000 Y')==0. But that does not seem like it is the most elegant solution.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以使用 itertools.groupby() 将序列分割成多个子序列。
如果您真的只想查找该行,那么 itertools.dropwhile() 也可以工作,
You can use
itertools.groupby()
to segment your sequence into multiple sub-sequences.itertools.dropwhile()
would also work if you really just want to seek up to that line,您可以将数据读入字典。 假设您正在读取类似文件的对象
infile
:现在,如果您想读取“007 A000000 Y”之后的行,您可以这样做:
You could read the data into a dictionary. Assuming you are reading from a file-like object
infile
:Now if you want to read the line after '007 A000000 Y', you can do so as:
使用字典中的所有数据的唯一困难是字典太大可能会变得很麻烦。 (这就是我们过去所说的“大 Ole 矩阵”方法。)
解决方案是在字典中构造一个索引,使用
创建 key->offset 的映射>tell
方法获取文件偏移值。 然后您可以通过使用seek
方法进行查找来再次引用该行。The only difficulty with using all the data in a dictionary is that a really big dictionary can become troublesome. (It's what we used to call the "Big Ole Matrix" approach.)
A solution to this is to construct an index in the Dictionary, creating a mapping of key->offset, using the
tell
method to get the file offset value. Then you can refer to the line again by seeking with theseek
method.您说您想这样做:
大概这是在“for dataLine in data”循环内。
或者,您可以直接使用迭代器而不是在 for 循环中:
您还提到要处理 60K 文件。 它们的格式都相似吗? 它们需要进行不同的处理吗? 如果它们都可以以相同的方式处理,您可以考虑将它们链接在一个流程中:
这使您可以在单个迭代器中延迟处理所有文件。 不确定这是否对您目前的情况有帮助,但我认为值得一提。
You said you wanted to do this:
Presumably this is within a "for dataLine in data" loop.
Alternatively, you could use an iterator directly instead of in a for loop:
You also mention having 60K files to process. Are they all formatted similarly? Do they need to be processed differently? If they can all be processed the same way, you could consider chaining them together in a single flow:
This lets you lazily process all your files in a single iterator. Not sure if that helps your current situation but I thought it worth mentioning.
好吧,当我在谷歌上搜索以确保我已经涵盖了我的基础时,我遇到了一个解决方案:
我发现即使我使用列表和字典,我也会忘记思考它们。 Python 有一些强大的工具可以处理这些类型,以提高您操作它们的能力。
轻松获得切片引用
我需要一个切片,以便可以通过tempans 是数据列表的位置
现在我可以写
我想我回答了我自己的问题。 我从其他答案中学到了很多东西并欣赏它们,但我认为这就是我需要的
好吧,我将进一步编辑我的答案。 我在这里学到了很多东西,但其中一些东西仍然超出了我的能力范围,我想在更多地了解这个神奇工具的同时编写一些代码。
这是基于对类似问题的早期回答和 Steven Huwig 的 回答
Okay-while I was Googling to make sure I had covered my bases I came across a solution:
I find that I forget to think in Lists and Dictionaries even though I use them. Python has some powerful tools to work with these types to speed your ability to manipulate them.
I need a slice so the slice references are easily obtained by
where tempans is the datalist
now I can write
I think I answered my own question. I learned alot from the other answers and appreciate them but I think this is what I needed
Okay I am going to further edit my answer. I have learned a lot here but some of this stuff is over my head still and I want to get some code written while I am learning more about this fantastic tool.
This is based on an earlier answer to a similar question and Steven Huwig's answer