如何提高Python中readline循环的速度?
我正在将文本格式的数据库转储的几个部分导入 MySQL,问题是 在有趣的数据之前,有很多无趣的东西。 我编写了这个循环来获取所需的数据:
def readloop(DBFILE):
txtdb=open(DBFILE, 'r')
sline = ""
# loop till 1st "customernum:" is found
while sline.startswith("customernum: ") is False:
sline = txtdb.readline()
while sline.startswith("customernum: "):
data = []
data.append(sline)
sline = txtdb.readline()
while sline.startswith("customernum: ") is False:
data.append(sline)
sline = txtdb.readline()
if len(sline) == 0:
break
customernum = getitem(data, "customernum: ")
street = getitem(data, "street: ")
country = getitem(data, "country: ")
zip = getitem(data, "zip: ")
文本文件非常大,因此循环直到第一个想要的条目需要很长时间。任何人都知道是否可以更快地完成此操作(或者如果我解决此问题的整个方法不是最好的主意)?
非常感谢!
i'm importing several parts of a Databasedump in text Format into MySQL, the problem is
that before the interesting Data there is very much non-interesting stuff infront.
I wrote this loop to get to the needed data:
def readloop(DBFILE):
txtdb=open(DBFILE, 'r')
sline = ""
# loop till 1st "customernum:" is found
while sline.startswith("customernum: ") is False:
sline = txtdb.readline()
while sline.startswith("customernum: "):
data = []
data.append(sline)
sline = txtdb.readline()
while sline.startswith("customernum: ") is False:
data.append(sline)
sline = txtdb.readline()
if len(sline) == 0:
break
customernum = getitem(data, "customernum: ")
street = getitem(data, "street: ")
country = getitem(data, "country: ")
zip = getitem(data, "zip: ")
The Textfile is pretty huge, so just looping till the first wanted entry takes very much time. Anyone has an idea if this could be done faster (or if the whole way i fixed this is not the best idea) ?
Many thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
请不要编写此代码:
布尔条件是布尔,用于大声喊叫,因此可以直接测试(或否定并测试):
您的第二个 while 循环没有写为“while 条件是True:”,我很好奇为什么你觉得有必要在第一个中测试“is False”。
拉出 dis 模块,我想我应该进一步剖析一下。根据我的 pyparsing 经验,函数调用是性能杀手,因此如果可能的话最好避免函数调用。这是您的原始测试:
这里发生了两个昂贵的事情,
CALL_FUNCTION
和LOAD_GLOBAL
。您可以通过为 False 定义本地名称来减少LOAD_GLOBAL
:但是如果我们完全放弃“is”测试会怎样?:
我们已经折叠了
LOAD_xxx
和 < code>COMPARE_OP 与简单的UNARY_NOT
。 “是假的”当然不会对性能产生任何帮助。现在,如果我们可以在不执行任何函数调用的情况下彻底消除一行,会怎样呢?如果该行的第一个字符不是“c”,则它不可能以('customernum') 开头。让我们尝试一下:(
请注意,使用 [0] 获取字符串的第一个字符不会创建切片 - 这实际上非常快。)
现在,假设没有大量对于以“c”开头的行,粗剪过滤器可以使用所有相当快的指令来消除行。事实上,通过测试“t[0] != 'c'”而不是“not t[0] == 'c'”,我们节省了一条无关的
UNARY_NOT
指令。因此,使用有关快捷优化的知识,我建议更改此代码:
为此:
请注意,我还删除了 .readline() 函数调用,并仅使用“for sline in txtdb”迭代文件。
我意识到亚历克斯提供了完全不同的代码体来查找第一个“customernum”行,但我会尝试在算法的一般范围内进行优化,然后再拿出大而晦涩的块读取枪。
Please do not write this code:
Boolean conditions are boolean for cryin' out loud, so they can be tested (or negated and tested) directly:
Your second while loop isn't written as "while condition is True:", I'm curious why you felt the need to test "is False" in the first one.
Pulling out the dis module, I thought I'd dissect this a little further. In my pyparsing experience, function calls are total performance killers, so it would be nice to avoid function calls if possible. Here is your original test:
Two expensive things happen here,
CALL_FUNCTION
andLOAD_GLOBAL
. You could cut back onLOAD_GLOBAL
by defining a local name for False:But what if we just drop the 'is' test completely?:
We've collapsed a
LOAD_xxx
andCOMPARE_OP
with a simpleUNARY_NOT
. "is False" certainly isn't helping the performance cause any.Now what if we can do some gross elimination of a line without doing any function calls at all. If the first character of the line is not a 'c', there is no way it will startswith('customernum'). Let's try that:
(Note that using [0] to get the first character of a string does not create a slice - this is in fact very fast.)
Now, assuming there are not a large number of lines starting with 'c', the rough-cut filter can eliminate a line using all fairly fast instructions. In fact, by testing "t[0] != 'c'" instead of "not t[0] == 'c'" we save ourselves an extraneous
UNARY_NOT
instruction.So using this learning about short-cut optimization and I suggest changing this code:
To this:
Note that I have also removed the .readline() function call, and just iterate over the file using "for sline in txtdb".
I realize Alex has provided a different body of code entirely for finding that first 'customernum' line, but I would try optimizing within the general bounds of your algorithm, before pulling out big but obscure block reading guns.
优化的总体思路是“按大块”(主要是忽略行结构)来定位感兴趣的第一行,然后继续按行处理其余行。它有点挑剔并且容易出错(相差一等),所以它确实需要测试,但总体思路如下......:
在这里,我尝试尽可能保持结构完整,除了这次重构的“大想法”之外,只做了一些小的改进。
The general idea for optimization is to proceed "by big blocks" (mostly-ignoring line structure) to locate the first line of interest, then move on to by-line processing for the rest). It's somewhat finicky and error-prone (off-by-one and the like) so it really needs testing, but the general idea is as follows...:
Here, I've tried to keep as much of your structure intact as feasible, doing only minor enhancements beyond the "big idea" of this refactoring.
我猜您正在编写这个导入脚本,并且在测试过程中等待会很无聊,因此数据始终保持不变。
您可以使用
print txtdb.tell()
运行一次脚本来检测要跳转到的文件中的实际位置。写下这些内容并将搜索代码替换为 txtdb.seek( pos ) 。基本上,这是为文件建立索引;-)另一种更传统的方法是以更大的块读取数据,一次读取几 MB,而不仅仅是一行上的几个字节。
I guess you are writing this import script and it gets boring to wait during testing it, so the data stays the same all the time.
You can run the script once to detect the actual positions in the file you want to jump to, with
print txtdb.tell()
. Write those down and replace the searching code withtxtdb.seek( pos )
. Basically that's builing an index for the file ;-)Another more convetional way would be to read data in larger chunks, a few MB at a time, not just the few bytes on a line.
这可能会有所帮助: Python 性能第 2 部分:解析“A Href”超文本的大字符串
This might help: Python Performance Part 2: Parsing Large Strings for 'A Href' Hypertext
告诉我们有关该文件的更多信息。
您可以使用 file.seek 进行二分搜索吗?寻找中间标记,读几行,确定是在您需要的部分之前还是之后,然后递归。这会将您的 O(n) 搜索变成 O(logn)。
Tell us more about the file.
Can you use file.seek to do a binary search? Seek to the halfway mark, read a few lines, determine if you are before or after the part you need, recurse. That will turn your O(n) search into O(logn).