为什么我的代码停止了?

发布于 2024-09-16 23:28:58 字数 2621 浏览 4 评论 0原文

嘿,我遇到了一个问题,由于某种我无法弄清楚的原因,我的程序停止迭代 57802 记录处的文件。我放入了一个心跳部分,这样我就可以看到它在哪一行,这很有帮助,但现在我不明白为什么它停在这里。我以为这是内存问题,但我只是在我的 6GB 内存计算机上运行它,但它仍然停止。

有没有更好的方法来做我在下面做的事情? 我的目标是阅读该文件(如果您需要我将其发送给您,我可以将 15MB 文本日志发送给您) 根据正则表达式查找匹配项并打印匹配行。还会有更多,但这就是我所得到的。我正在使用 python 2.6

任何想法都会有所帮助,代码注释也!我是一个Python菜鸟,仍在学习中。

import sys, os, os.path, operator
import re, time, fileinput

infile = os.path.join("C:\\","Python26","Scripts","stdout.log")

start = time.clock()

filename  = open(infile,"r")

match = re.compile(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),\d{3} +\w+ +\[([\w.]+)\] ((\w+).?)+:\d+ - (\w+)_SEARCH:(.+)')

count = 0
heartbeat = 0
for line in filename:
    heartbeat = heartbeat + 1
    print heartbeat
    lookup = match.search(line)
    if lookup:
        count = count + 1
        print line
end = time.clock()
elapsed = end-start
print "Finished processing at:",elapsed,"secs. Count of records =",count,"."

filename.close()

这是第 57802 行,它失败了:

2010-08-06 08:15:15,390 DEBUG [ah_admin] com.thg.struts2.SecurityInterceptor.intercept:46 - Action not SecurityAware; skipping privilege check.

这是匹配行:

2010-08-06 09:27:29,545 INFO  [patrick.phelan] com.thg.sam.actions.marketmaterial.MarketMaterialAction.result:223 - MARKET_MATERIAL_SEARCH:{"_appInfo":{"_appId":21,"_companyDivisionId":42,"_environment":"PRODUCTION"},"_description":"symlin","_createdBy":"","_fieldType":"GEO","_geoIds":["Illinois"],"_brandIds":[2883],"_archived":"ACTIVE","_expired":"UNEXPIRED","_customized":"CUSTOMIZED","_webVisible":"VISIBLE_ONLY"}

仅前 5 行的示例数据:

2010-08-06 00:00:00,035 DEBUG [] com.thg.sam.jobs.PlanFormularyLoadJob.executeInternal:67 - Entered into PlanFormularyLoadJob: executeInternal
2010-08-06 00:00:00,039 DEBUG [] com.thg.ftpComponent.service.JScapeFtpService.open:153 - Opening FTP connection to sdrive/[email protected]:21
2010-08-06 00:00:00,040 DEBUG [] com.thg.sam.email.EmailUtils.sendEmail:206 - org.apache.commons.mail.MultiPartEmail@446e79
2010-08-06 00:00:00,045 DEBUG [] com.thg.sam.services.OrderService.getOrdersWithStatus:121 - Orders list size=13
2010-08-06 00:00:00,045 DEBUG [] com.thg.ftpComponent.service.JScapeFtpService.open:153 - Opening FTP connection to sdrive/[email protected]:21

Hey I've encountered an issue where my program stops iterating through the file at the 57802 record for some reason I cannot figure out. I put a heartbeat section in so I would be able to see which line it is on and it helped but now I am stuck as to why it stops here. I thought it was a memory issue but I just ran it on my 6GB memory computer and it still stopped.

Is there a better way to do anything I am doing below?
My goal is to read the file (if you need me to send it to you I can 15MB text log)
find a match based on the regex expression and print the matching line. More to come but that's as far as I have gotten. I am using python 2.6

Any ideas would help and code comments also! I am a python noob and am still learning.

import sys, os, os.path, operator
import re, time, fileinput

infile = os.path.join("C:\\","Python26","Scripts","stdout.log")

start = time.clock()

filename  = open(infile,"r")

match = re.compile(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),\d{3} +\w+ +\[([\w.]+)\] ((\w+).?)+:\d+ - (\w+)_SEARCH:(.+)')

count = 0
heartbeat = 0
for line in filename:
    heartbeat = heartbeat + 1
    print heartbeat
    lookup = match.search(line)
    if lookup:
        count = count + 1
        print line
end = time.clock()
elapsed = end-start
print "Finished processing at:",elapsed,"secs. Count of records =",count,"."

filename.close()

This is line 57802 where it fails:

2010-08-06 08:15:15,390 DEBUG [ah_admin] com.thg.struts2.SecurityInterceptor.intercept:46 - Action not SecurityAware; skipping privilege check.

This is a matching line:

2010-08-06 09:27:29,545 INFO  [patrick.phelan] com.thg.sam.actions.marketmaterial.MarketMaterialAction.result:223 - MARKET_MATERIAL_SEARCH:{"_appInfo":{"_appId":21,"_companyDivisionId":42,"_environment":"PRODUCTION"},"_description":"symlin","_createdBy":"","_fieldType":"GEO","_geoIds":["Illinois"],"_brandIds":[2883],"_archived":"ACTIVE","_expired":"UNEXPIRED","_customized":"CUSTOMIZED","_webVisible":"VISIBLE_ONLY"}

Sample data just the first 5 lines:

2010-08-06 00:00:00,035 DEBUG [] com.thg.sam.jobs.PlanFormularyLoadJob.executeInternal:67 - Entered into PlanFormularyLoadJob: executeInternal
2010-08-06 00:00:00,039 DEBUG [] com.thg.ftpComponent.service.JScapeFtpService.open:153 - Opening FTP connection to sdrive/[email protected]:21
2010-08-06 00:00:00,040 DEBUG [] com.thg.sam.email.EmailUtils.sendEmail:206 - org.apache.commons.mail.MultiPartEmail@446e79
2010-08-06 00:00:00,045 DEBUG [] com.thg.sam.services.OrderService.getOrdersWithStatus:121 - Orders list size=13
2010-08-06 00:00:00,045 DEBUG [] com.thg.ftpComponent.service.JScapeFtpService.open:153 - Opening FTP connection to sdrive/[email protected]:21

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

心作怪 2024-09-23 23:28:58

给你带来麻烦的输入线是什么样的?我会尝试打印出来。我怀疑你的CPU在运行时被固定。

嵌套正则表达式,就像您所拥有的那样,当它们不能快速匹配时,可能会非常表现不佳。

((\w+).?)+:

想象一个字符串中没有 : 但相当长。当正则表达式尝试各种方法组合来分隔 \w 和 之间的单词字符时,您最终将陷入回溯的世界。然后尝试以各种可能的方式将它们分组。如果你的模式能够更加具体,那么将会获得巨大的回报。

What does the input line that gives you trouble look like? I'd try printing that out. I suspect your CPU is pegged while this is running.

Nested regexps, like you have can have VERY bad performance when they don't match quickly.

((\w+).?)+:

Imagine a string that doesn't have the : in it but is fairly long. You'll end up in a world of backtracking as the regexp tries EVERY combination of ways to separate word characters between \w and . and THEN tries to group them in every way possible. If you can be more specific in your pattern it'll pay off big time.

十年不长 2024-09-23 23:28:58

你的问题肯定是 @paulrubel 指出的部分:

((\w+).?)+:\d+

现在你已经添加了示例数据,很明显 . 应该匹配文字点,这意味着你应该转义它(>\.)。另外,你不需要内部的括号,外部的括号应该是非捕获的,但它是杀死你的基本结构;在放弃之前,它必须尝试太多的单词字符和点的排列。在尝试正则表达式的该部分之前,其他行都失败,这就是为什么您对它们没有任何问题。

当我在 RegexBuddy 中尝试时,你的正则表达式在 186 个步骤中匹配了好的行,并在 1,000,000 个步骤后放弃在第 57802 行上的尝试。当我转义点时,好的行只需要90步就可以匹配,但它仍然在第57802行超时。但现在我知道正则表达式的一部分只能匹配单词字符和点。一旦它消耗完所有可以消耗的数据,下一位就必须匹配 :\d+;如果没有,我知道尝试其他安排是没有意义的。我可以使用原子组来告诉它不要打扰:

(?>(?:\w+\.?)+):\d+

通过这一更改,好的行在 83 个步骤中匹配,而第 57802 行只需要 66 个步骤就报告失败。但使用原子组并不总是可行,因此您应该尝试使正则表达式符合其匹配的文本的实际结构。在这种情况下,您将匹配看起来像 Java 类名的内容(一些单词字符,后跟零个或多个实例(一个点和更多单词字符)),后跟一个冒号和一个行号:

\w+(?:\.\w+)*:\d+

当我插入该名称时进入正则表达式,它在 80 步内匹配好行,并在 67 步内拒绝第 57802 行——甚至不需要原子组。

Your problem is definitely the part @paulrubel pointed out:

((\w+).?)+:\d+

Now that you've added sample data, it's obvious that the . is supposed to match a literal dot, which means you should have escaped it (\.). Also, you don't need the inner set of parentheses, and the outer set should be non-capturing, but it's the basic structure that's killing you; there are too many arrangements of word characters and dots it has to try before giving up. The other lines all fail before that part of the regex is attempted, which is why you don't have any problem with them.

When I try it in RegexBuddy, your regex matches the good line in 186 steps, and gives up trying on line 57802 after 1,000,000 steps. When I escape the dot, the good line only takes 90 steps to match, but it still times out on line 57802. But now I know that part of the regex can only match word characters and dots. Once it has consumed all of those it can, the next bit has to match :\d+; if it doesn't, I know there's no point trying other arrangements. I can use an atomic group to tell it not to bother:

(?>(?:\w+\.?)+):\d+

With that change, the good line matches in 83 steps, and line 57802 only takes 66 steps to report failure. But it's not always feasible to use atomic groups, so you should try to make your regex conform to the actual structure of the text it's matching. In this case you're matching what looks like a Java class name (some word characters, followed by zero or more instances of (a dot and some more word characters)) followed by a colon and and a line number:

\w+(?:\.\w+)*:\d+

When I plug that into the regex, it matches the good line in 80 steps, and rejects line 57802 in 67 steps--the atomic group isn't even needed.

不美如何 2024-09-23 23:28:58

您编译了正则表达式但从未使用它?

lookup = re.search(match,line)

应该是

lookup = match.search(line)

,你应该使用 os.path.join()

infile = os.path.join("C:\\","Python26","Scripts","stdout.log")

更新:

你的正则表达式可以更简单。只需检查日期时间戳。否则,根本不要使用正则表达式。假设您的日期和时间从行首开始

for line in open("stdout.log"):
    s = line.split()
    D,T=s[0],s[1]
    # use the time module and strptime to check valid date/time
    # or you can split "-" on D and T and do manual check using > or < and math

you compiled your regex but never use it?

lookup = re.search(match,line)

should be

lookup = match.search(line)

and you should use os.path.join()

infile = os.path.join("C:\\","Python26","Scripts","stdout.log")

Update:

Your regular expression can be simpler.Just check for the date time stamp. Or else, don't use regular expression at all. Say your date and time starts at beginning of line

for line in open("stdout.log"):
    s = line.split()
    D,T=s[0],s[1]
    # use the time module and strptime to check valid date/time
    # or you can split "-" on D and T and do manual check using > or < and math
一页 2024-09-23 23:28:58

您的模式包含固定字符串 SEARCH_ 和一堆复杂的表达式(包括捕获),这些表达式确实会影响正则表达式引擎。但是您不对捕获的文本执行任何操作,因此您只想知道“它是否匹配” ?

仅搜索每行上的固定模式可能会更简单、更快。

if '_SEARCH:' in line:
    print line
    count += 1

Your pattern contains the fixed string SEARCH_ and a bunch of complicated expressions (including captures) that are really going to hammer the regex engine.. but you don't do anything with the captured text so all you want to know 'is does it match?'

It may be simpler and quicker to just search for the fixed pattern on each line.

if '_SEARCH:' in line:
    print line
    count += 1
舂唻埖巳落 2024-09-23 23:28:58

无论如何,这可能是一个内存问题。对于大文件,最好使用 fileinput 模块,如下所示:

import fileinput
for line in fileinput.input([infile]):
    lookup = re.search(match, line)
     # etc.

It might be a memory issue anyway. With huge files it's probably better to use the fileinput module instead like this:

import fileinput
for line in fileinput.input([infile]):
    lookup = re.search(match, line)
     # etc.
初见你 2024-09-23 23:28:58

尝试使用 pdb。如果您在心跳停止前不久将 pdb.set_trace() 放入心跳中,则可以查看它停止的特定行,并查看每行代码对该行执行的操作。

编辑:pdb 使用的示例:

import pdb
for i in range(50):
    print i
    if i == 12:
        pdb.set_trace()

运行该脚本,您将得到类似以下内容:

0
1
2
3
4
5
6
7
8
9
10
11
12
> <stdin>(1)<module>()
(Pdb)

现在您可以从 i=12 的上下文中计算 Python 表达式。

(Pdb) print i
12

使用它,但在增加心跳后将 pdb.set_trace() 放入循环中,如果 heartbeat == 57802。然后,您可以使用 p line 打印出 line,使用 p match.search(line) 打印正则表达式搜索的结果,等等。

Try using pdb. If you put pdb.set_trace() in your heartbeat shortly before it stops, you can look at the specific line it's stopping on and see what each of your lines of code does with that line.

Edit: An example of pdb use:

import pdb
for i in range(50):
    print i
    if i == 12:
        pdb.set_trace()

Run that script, and you'll get something like the following:

0
1
2
3
4
5
6
7
8
9
10
11
12
> <stdin>(1)<module>()
(Pdb)

Now you can evaluate Python expressions from the context of i=12.

(Pdb) print i
12

Use that, but put the pdb.set_trace() in your loop after you increment heartbeat, if heartbeat == 57802. Then you can print out line with p line, the result of your regex search with p match.search(line), etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文