如何提高Python中readline循环的速度?

发布于 2024-08-05 00:39:23 字数 816 浏览 6 评论 0原文

我正在将文本格式的数据库转储的几个部分导入 MySQL,问题是 在有趣的数据之前,有很多无趣的东西。 我编写了这个循环来获取所需的数据:

def readloop(DBFILE):
    txtdb=open(DBFILE, 'r')

sline = ""

# loop till 1st "customernum:" is found
while sline.startswith("customernum:  ") is False: 
    sline = txtdb.readline()

while sline.startswith("customernum:  "):
    data = []
    data.append(sline)
    sline = txtdb.readline()
    while sline.startswith("customernum:  ") is False:
        data.append(sline)
        sline = txtdb.readline()
        if len(sline) == 0:
            break
    customernum = getitem(data, "customernum:  ")
    street = getitem(data, "street:  ")
    country = getitem(data, "country:  ")
    zip = getitem(data, "zip:  ")

文本文件非常大,因此循环直到第一个想要的条目需要很长时间。任何人都知道是否可以更快地完成此操作(或者如果我解决此问题的整个方法不是最好的主意)?

非常感谢!

i'm importing several parts of a Databasedump in text Format into MySQL, the problem is
that before the interesting Data there is very much non-interesting stuff infront.
I wrote this loop to get to the needed data:

def readloop(DBFILE):
    txtdb=open(DBFILE, 'r')

sline = ""

# loop till 1st "customernum:" is found
while sline.startswith("customernum:  ") is False: 
    sline = txtdb.readline()

while sline.startswith("customernum:  "):
    data = []
    data.append(sline)
    sline = txtdb.readline()
    while sline.startswith("customernum:  ") is False:
        data.append(sline)
        sline = txtdb.readline()
        if len(sline) == 0:
            break
    customernum = getitem(data, "customernum:  ")
    street = getitem(data, "street:  ")
    country = getitem(data, "country:  ")
    zip = getitem(data, "zip:  ")

The Textfile is pretty huge, so just looping till the first wanted entry takes very much time. Anyone has an idea if this could be done faster (or if the whole way i fixed this is not the best idea) ?

Many thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

难理解 2024-08-12 00:39:23

请不要编写此代码:

while condition is False:

布尔条件是布尔,用于大声喊叫,因此可以直接测试(或否定并测试):

while not condition:

您的第二个 while 循环没有写为“while 条件是True:”,我很好奇为什么你觉得有必要在第一个中测试“is False”。

拉出 dis 模块,我想我应该进一步剖析一下。根据我的 pyparsing 经验,函数调用是性能杀手,因此如果可能的话最好避免函数调用。这是您的原始测试:

>>> test = lambda t : t.startswith('customernum') is False
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 LOAD_GLOBAL              1 (False)
             15 COMPARE_OP               8 (is)
             18 RETURN_VALUE

这里发生了两个昂贵的事情,CALL_FUNCTIONLOAD_GLOBAL。您可以通过为 False 定义本地名称来减少 LOAD_GLOBAL

>>> test = lambda t,False=False : t.startswith('customernum') is False
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 LOAD_FAST                1 (False)
             15 COMPARE_OP               8 (is)
             18 RETURN_VALUE

但是如果我们完全放弃“is”测试会怎样?:

>>> test = lambda t : not t.startswith('customernum')
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 UNARY_NOT
             13 RETURN_VALUE

我们已经折叠了 LOAD_xxx 和 < code>COMPARE_OP 与简单的 UNARY_NOT。 “是假的”当然不会对性能产生任何帮助。

现在,如果我们可以在不执行任何函数调用的情况下彻底消除一行,会怎样呢?如果该行的第一个字符不是“c”,则它不可能以('customernum') 开头。让我们尝试一下:(

>>> test = lambda t : t[0] != 'c' and not t.startswith('customernum')
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_CONST               0 (0)
              6 BINARY_SUBSCR
              7 LOAD_CONST               1 ('c')
             10 COMPARE_OP               3 (!=)
             13 JUMP_IF_FALSE           14 (to 30)
             16 POP_TOP
             17 LOAD_FAST                0 (t)
             20 LOAD_ATTR                0 (startswith)
             23 LOAD_CONST               2 ('customernum')
             26 CALL_FUNCTION            1
             29 UNARY_NOT
        >>   30 RETURN_VALUE

请注意,使用 [0] 获取字符串的第一个字符不会创建切片 - 这实际上非常快。)

现在,假设没有大量对于以“c”开头的行,粗剪过滤器可以使用所有相当快的指令来消除行。事实上,通过测试“t[0] != 'c'”而不是“not t[0] == 'c'”,我们节省了一条无关的 UNARY_NOT 指令。

因此,使用有关快捷优化的知识,我建议更改此代码:

while sline.startswith("customernum:  ") is False:
    sline = txtdb.readline()

while sline.startswith("customernum:  "):
    ... do the rest of the customer data stuff...

为此:

for sline in txtdb:
    if sline[0] == 'c' and \ 
       sline.startswith("customernum:  "):
        ... do the rest of the customer data stuff...

请注意,我还删除了 .readline() 函数调用,并仅使用“for sline in txtdb”迭代文件。

我意识到亚历克斯提供了完全不同的代码体来查找第一个“customernum”行,但我会尝试在算法的一般范围内进行优化,然后再拿出大而晦涩的块读取枪。

Please do not write this code:

while condition is False:

Boolean conditions are boolean for cryin' out loud, so they can be tested (or negated and tested) directly:

while not condition:

Your second while loop isn't written as "while condition is True:", I'm curious why you felt the need to test "is False" in the first one.

Pulling out the dis module, I thought I'd dissect this a little further. In my pyparsing experience, function calls are total performance killers, so it would be nice to avoid function calls if possible. Here is your original test:

>>> test = lambda t : t.startswith('customernum') is False
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 LOAD_GLOBAL              1 (False)
             15 COMPARE_OP               8 (is)
             18 RETURN_VALUE

Two expensive things happen here, CALL_FUNCTION and LOAD_GLOBAL. You could cut back on LOAD_GLOBAL by defining a local name for False:

>>> test = lambda t,False=False : t.startswith('customernum') is False
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 LOAD_FAST                1 (False)
             15 COMPARE_OP               8 (is)
             18 RETURN_VALUE

But what if we just drop the 'is' test completely?:

>>> test = lambda t : not t.startswith('customernum')
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 UNARY_NOT
             13 RETURN_VALUE

We've collapsed a LOAD_xxx and COMPARE_OP with a simple UNARY_NOT. "is False" certainly isn't helping the performance cause any.

Now what if we can do some gross elimination of a line without doing any function calls at all. If the first character of the line is not a 'c', there is no way it will startswith('customernum'). Let's try that:

>>> test = lambda t : t[0] != 'c' and not t.startswith('customernum')
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_CONST               0 (0)
              6 BINARY_SUBSCR
              7 LOAD_CONST               1 ('c')
             10 COMPARE_OP               3 (!=)
             13 JUMP_IF_FALSE           14 (to 30)
             16 POP_TOP
             17 LOAD_FAST                0 (t)
             20 LOAD_ATTR                0 (startswith)
             23 LOAD_CONST               2 ('customernum')
             26 CALL_FUNCTION            1
             29 UNARY_NOT
        >>   30 RETURN_VALUE

(Note that using [0] to get the first character of a string does not create a slice - this is in fact very fast.)

Now, assuming there are not a large number of lines starting with 'c', the rough-cut filter can eliminate a line using all fairly fast instructions. In fact, by testing "t[0] != 'c'" instead of "not t[0] == 'c'" we save ourselves an extraneous UNARY_NOT instruction.

So using this learning about short-cut optimization and I suggest changing this code:

while sline.startswith("customernum:  ") is False:
    sline = txtdb.readline()

while sline.startswith("customernum:  "):
    ... do the rest of the customer data stuff...

To this:

for sline in txtdb:
    if sline[0] == 'c' and \ 
       sline.startswith("customernum:  "):
        ... do the rest of the customer data stuff...

Note that I have also removed the .readline() function call, and just iterate over the file using "for sline in txtdb".

I realize Alex has provided a different body of code entirely for finding that first 'customernum' line, but I would try optimizing within the general bounds of your algorithm, before pulling out big but obscure block reading guns.

陪你到最终 2024-08-12 00:39:23

优化的总体思路是“按大块”(主要是忽略行结构)来定位感兴趣的第一行,然后继续按行处理其余行。它有点挑剔并且容易出错(相​​差一等),所以它确实需要测试,但总体思路如下......:

import itertools

def readloop(DBFILE):
  txtdb=open(DBFILE, 'r')
  tag = "customernum:  "
  BIGBLOCK = 1024 * 1024
  # locate first occurrence of tag at line-start
  # (assumes the VERY FIRST line doesn't start that way,
  # else you need a special-case and slight refactoring)
  blob = ''
  while True:
    blob = blob + txtdb.read(BIGBLOCK)
    if not blob:
      # tag not present at all -- warn about that, then
      return
    where = blob.find('\n' + tag)
    if where != -1:  # found it!
      blob = blob[where+1:] + txtdb.readline()
      break
    blob = blob[-len(tag):]
  # now make a by-line iterator over the part of interest
  thelines = itertools.chain(blob.splitlines(1), txtdb)
  sline = next(thelines, '')
  while sline.startswith(tag):
    data = []
    data.append(sline)
    sline = next(thelines, '')
    while not sline.startswith(tag):
      data.append(sline)
      sline = next(thelines, '')
      if not sline:
        break
    customernum = getitem(data, "customernum:  ")
    street = getitem(data, "street:  ")
    country = getitem(data, "country:  ")
    zip = getitem(data, "zip:  ")

在这里,我尝试尽可能保持结构完整,除了这次重构的“大想法”之外,只做了一些小的改进。

The general idea for optimization is to proceed "by big blocks" (mostly-ignoring line structure) to locate the first line of interest, then move on to by-line processing for the rest). It's somewhat finicky and error-prone (off-by-one and the like) so it really needs testing, but the general idea is as follows...:

import itertools

def readloop(DBFILE):
  txtdb=open(DBFILE, 'r')
  tag = "customernum:  "
  BIGBLOCK = 1024 * 1024
  # locate first occurrence of tag at line-start
  # (assumes the VERY FIRST line doesn't start that way,
  # else you need a special-case and slight refactoring)
  blob = ''
  while True:
    blob = blob + txtdb.read(BIGBLOCK)
    if not blob:
      # tag not present at all -- warn about that, then
      return
    where = blob.find('\n' + tag)
    if where != -1:  # found it!
      blob = blob[where+1:] + txtdb.readline()
      break
    blob = blob[-len(tag):]
  # now make a by-line iterator over the part of interest
  thelines = itertools.chain(blob.splitlines(1), txtdb)
  sline = next(thelines, '')
  while sline.startswith(tag):
    data = []
    data.append(sline)
    sline = next(thelines, '')
    while not sline.startswith(tag):
      data.append(sline)
      sline = next(thelines, '')
      if not sline:
        break
    customernum = getitem(data, "customernum:  ")
    street = getitem(data, "street:  ")
    country = getitem(data, "country:  ")
    zip = getitem(data, "zip:  ")

Here, I've tried to keep as much of your structure intact as feasible, doing only minor enhancements beyond the "big idea" of this refactoring.

你没皮卡萌 2024-08-12 00:39:23

我猜您正在编写这个导入脚本,并且在测试过程中等待会很无聊,因此数据始终保持不变。

您可以使用 print txtdb.tell() 运行一次脚本来检测要跳转到的文件中的实际位置。写下这些内容并将搜索代码替换为 txtdb.seek( pos ) 。基本上,这是为文件建立索引;-)

另一种更传统的方法是以更大的块读取数据,一次读取几 MB,而不仅仅是一行上的几个字节。

I guess you are writing this import script and it gets boring to wait during testing it, so the data stays the same all the time.

You can run the script once to detect the actual positions in the file you want to jump to, with print txtdb.tell(). Write those down and replace the searching code with txtdb.seek( pos ). Basically that's builing an index for the file ;-)

Another more convetional way would be to read data in larger chunks, a few MB at a time, not just the few bytes on a line.

花间憩 2024-08-12 00:39:23

告诉我们有关该文件的更多信息。

您可以使用 file.seek 进行二分搜索吗?寻找中间标记,读几行,确定是在您需要的部分之前还是之后,然后递归。这会将您的 O(n) 搜索变成 O(logn)。

Tell us more about the file.

Can you use file.seek to do a binary search? Seek to the halfway mark, read a few lines, determine if you are before or after the part you need, recurse. That will turn your O(n) search into O(logn).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文