Python 多行正则表达式 +一次性读取一个文件的多个条目

发布于 2024-11-01 22:03:30 字数 1257 浏览 2 评论 0原文

//Last modified: Sat, Apr 16, 2011 09:55:04 AM
//Codeset: ISO-8859-1
fileInfo "version" "20x64";
createNode newnode -n "a_SET";
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" "blabla";
    setAttr -l on -k on ".test2" -type "string" "blablabla";
createNode newnode -n "b_SET";
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" "hmm";
    setAttr -l on -k on ".test2" -type "string" "ehmehm";

在Python中：

我需要读取新节点名称，例如“a_SET”和“b_SET”及其相应的属性值，因此{“a_SET”：{“test1”：“blabla”，“test2”：“blablabla”}和相同的对于 b_SET - 可能有未知数量的集合 - 例如 c_SET d_SET 等。

我尝试循环遍历行并在那里匹配它：

for line in fileopened:
    setmatch = re.match( r'^(createNode set -n ")(.*)(_SET)(.*)' , line)
     if setmatch:
            sets.append(setmatch.group(2))

一旦我在这里找到匹配项，我就会循环遍历下一行以获得该集合的属性（test1，test2），直到我找到一个新集合 - 例如 c_SET 或 EOF。

使用 re.MULTILINE 一次性获取所有信息的最佳方式是什么？

原文

//Last modified: Sat, Apr 16, 2011 09:55:04 AM
//Codeset: ISO-8859-1
fileInfo "version" "20x64";
createNode newnode -n "a_SET";
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" "blabla";
    setAttr -l on -k on ".test2" -type "string" "blablabla";
createNode newnode -n "b_SET";
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" "hmm";
    setAttr -l on -k on ".test2" -type "string" "ehmehm";

in Python:

I need to read the newnode names for instance "a_SET" and "b_SET" and their corresponding attribute values so {"a_SET": {"test1":"blabla", "test2":"blablabla"} and the same for the b_SET - there could be unknown amount of sets - like c_SET d_SET etc.

I've tried looping through lines and matching it there:

for line in fileopened:
    setmatch = re.match( r'^(createNode set -n ")(.*)(_SET)(.*)' , line)
     if setmatch:
            sets.append(setmatch.group(2))

and as soon as I find a match here I would loop through next lines to get the attributes (test1, test2) for that set until I find a new set - for instance c_SET or an EOF.

What would be the best way to grab all that info in one go with the re.MULTILINE?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤独岁月 2024-11-08 22:03:30

您可以使用正则表达式正向预测来拆分组：

(yourGroupSeparator)(.*?)(?=yourGroupSeparator|\Z)

在您的示例中：

import re

lines = open("e:/temp/test.txt").read()
matches = re.findall(r'createNode newnode \-n (\"._SET\");(.*?)(?=createNode|\Z)', lines, re.MULTILINE + re.DOTALL);

for m in matches:
    print "%s:" % m[0], m[1]


"""
Result:
>>>
"a_SET":
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" "blabla";
    setAttr -l on -k on ".test2" -type "string" "blablabla";

"b_SET":
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" "hmm";
    setAttr -l on -k on ".test2" -type "string" "ehmehm";
"""

如果您想要字典上的结果，您可以

result = {}
for k, v in matches:
    result[k] = v   # or maybe v.split() or v.split(";")

在 findall 之后使用：

You can use regexp positive lookahead to split the groups:

(yourGroupSeparator)(.*?)(?=yourGroupSeparator|\Z)

In your example:

import re

lines = open("e:/temp/test.txt").read()
matches = re.findall(r'createNode newnode \-n (\"._SET\");(.*?)(?=createNode|\Z)', lines, re.MULTILINE + re.DOTALL);

for m in matches:
    print "%s:" % m[0], m[1]


"""
Result:
>>>
"a_SET":
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" "blabla";
    setAttr -l on -k on ".test2" -type "string" "blablabla";

"b_SET":
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" "hmm";
    setAttr -l on -k on ".test2" -type "string" "ehmehm";
"""

If you want the results on a dict, you can use:

result = {}
for k, v in matches:
    result[k] = v   # or maybe v.split() or v.split(";")

after findall

回复收藏 0 原文

梦与时光遇 2024-11-08 22:03:30

我得到了这个：

import re

filename = 'tr.txt'

with open(filename,'r') as f:
    ch = f.read()

pat = re.compile('createNode newnode -n ("\w+?_SET");(.*?)(?=createNode|\Z)',re.DOTALL)
pit = re.compile('^ *setAttr.+?("[^"\n]+").+("[^"\n]+");(?:\n|\Z)',re.MULTILINE)

dic = dict( (mat.group(1),dict(pit.findall(mat.group(2)))) for mat in pat.finditer(ch)) 
print dic

结果

{'"b_SET"': {'".test2"': '"ehmehm"', '".test1"': '"hmm"'}, '"a_SET"': {'".test2"': '"blablabla"', '".test1"': '"blabla"'}}

。

问题：

如果字符串中必须有字符 '"' 怎么办？它是如何表示的？

。

编辑

我很难找到解决方案，因为我没有选择该设施。

这是一个新模式捕获出现在字符串 " setAttr" 之后和之前的第一个字符串 "..." 和最后一个字符串 "..." next " setAttr". 因此可以存在多个 "..." ，而不仅仅是 3 个。您没有问这个条件，但我认为它可能恰好是。

我还设法使字符串中存在换行符以捕获 "....\n......" ，不仅在它们周围，我有义务为我发明一些新东西： (?:\n(?! *setAttr)|[^"\n]) 这意味着：除 '"' 之外的所有字符和常见的 newlines \n ，都被接受，并且仅接受后面没有以 ' *setAttr' 开头的行的换行符

For (?:\n( ?! *setAttr)|.) 它的意思是：换行符后面没有以 ' *setAttr' 和所有其他非换行符开头的行。

因此，任何其他特殊序列（如制表符或其他任何序列）都会在匹配中自动接受。

ch = '''//Last modified: Sat, Apr 16, 2011 09:55:04 AM
//Codeset: ISO-8859-1
fileInfo "version" "20x64";
createNode newnode -n "a_SET";
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" "blabla";
    setAttr -l on -k on ".test2" -type "string" "blablabla";
createNode newnode -n "b_SET";
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" (
      "hmm bl
      abla\tbla" );
    setAttr -l on -k on ".tes\nt\t2" -type "string" "ehm\tehm";
    setAttr -l on -k on ".test3" -type "string" "too
    much" "pff" """ "feretini" "gol\nolo";
    '''

import re

pat = re.compile('createNode newnode -n ("\w+?_SET");(.*?)(?=createNode|\Z)',re.DOTALL)
pot = re.compile('^ *setAttr.+?'
                 '"((?:\n(?! *setAttr)|[^"\n])+)"'
                 '(?:\n(?! *setAttr)|.)+'
                 '"((?:\n(?! *setAttr)|[^"\n])+)"'
                 '.*;(?:\n|\Z)',re.MULTILINE)

dic = dict( (mat.group(1),dict(pot.findall(mat.group(2)))) for mat in pat.finditer(ch)) 
for x in dic:
    print x,'\n',dic[x],'\n'

结果

"b_SET" 
{'.test3': 'gol\nolo', '.test1': 'hmm bl\n      abla\tbla', '.tes\nt\t2': 'ehm\tehm'} 

"a_SET" 
{'.test1': 'blabla', '.test2': 'blablabla'}

I got this:

import re

filename = 'tr.txt'

with open(filename,'r') as f:
    ch = f.read()

pat = re.compile('createNode newnode -n ("\w+?_SET");(.*?)(?=createNode|\Z)',re.DOTALL)
pit = re.compile('^ *setAttr.+?("[^"\n]+").+("[^"\n]+");(?:\n|\Z)',re.MULTILINE)

dic = dict( (mat.group(1),dict(pit.findall(mat.group(2)))) for mat in pat.finditer(ch)) 
print dic

result

{'"b_SET"': {'".test2"': '"ehmehm"', '".test1"': '"hmm"'}, '"a_SET"': {'".test2"': '"blablabla"', '".test1"': '"blabla"'}}

Question:

what if there must be character '"' in the strings ? How is it represented ?

EDIT

I had some difficulty to find the solution because I didn't choose the facility.

Here's a new pattern that catches the FIRST string "..." and the LAST string "..." present after a string " setAttr" and before the next " setAttr". So several "..." can be present , not only 3. You didn't asked this condition, but I thought it may happen to be needed.

I also managed to make possible the presence of newlines in the strings to catch "....\n......" , not only around them. For that , I was obliged to invent something new for me: (?:\n(?! *setAttr)|[^"\n]) that means : all characters, except '"' and common newlines \n , are accepted and also only the newlines that are not followed by a line beginning with ' *setAttr'

For (?:\n(?! *setAttr)|.) it means : newlines not followed by a line beginning with ' *setAttr' and all the other non-newline characters.

Hence, any other special sequence as tab or whatever else are automatically accpted in the matchings.

ch = '''//Last modified: Sat, Apr 16, 2011 09:55:04 AM
//Codeset: ISO-8859-1
fileInfo "version" "20x64";
createNode newnode -n "a_SET";
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" "blabla";
    setAttr -l on -k on ".test2" -type "string" "blablabla";
createNode newnode -n "b_SET";
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" (
      "hmm bl
      abla\tbla" );
    setAttr -l on -k on ".tes\nt\t2" -type "string" "ehm\tehm";
    setAttr -l on -k on ".test3" -type "string" "too
    much" "pff" """ "feretini" "gol\nolo";
    '''

import re

pat = re.compile('createNode newnode -n ("\w+?_SET");(.*?)(?=createNode|\Z)',re.DOTALL)
pot = re.compile('^ *setAttr.+?'
                 '"((?:\n(?! *setAttr)|[^"\n])+)"'
                 '(?:\n(?! *setAttr)|.)+'
                 '"((?:\n(?! *setAttr)|[^"\n])+)"'
                 '.*;(?:\n|\Z)',re.MULTILINE)

dic = dict( (mat.group(1),dict(pot.findall(mat.group(2)))) for mat in pat.finditer(ch)) 
for x in dic:
    print x,'\n',dic[x],'\n'

result

"b_SET" 
{'.test3': 'gol\nolo', '.test1': 'hmm bl\n      abla\tbla', '.tes\nt\t2': 'ehm\tehm'} 

"a_SET" 
{'.test1': 'blabla', '.test2': 'blablabla'}

回复收藏 0 原文

海夕 2024-11-08 22:03:30

另一个可能的选择：

createNode newnode -n "b_SET";
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" (
      "hmm blablabla" );
    setAttr -l on -k on ".test2" -type "string" "ehmehm";

正如您所看到的 ".test1" 值现在用 /n 行分隔符分隔。您将如何使用 eyquem 的方法来解决这个问题？

pit = re.compile('^ *setAttr.+?("[^"\n]+").+("[^"\n]+");(?:\n|\Z)',re.MULTILINE)

Another possible option:

createNode newnode -n "b_SET";
    addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
    setAttr -l on -k off ".tx";
    setAttr -l on -k off ".ty";
    setAttr -l on -k off ".sz";
    setAttr -l on -k on ".test1" -type "string" (
      "hmm blablabla" );
    setAttr -l on -k on ".test2" -type "string" "ehmehm";

So as you can see ".test1" value is now split with a /n line separator. How would you go around that using eyquem's approach?

pit = re.compile('^ *setAttr.+?("[^"\n]+").+("[^"\n]+");(?:\n|\Z)',re.MULTILINE)

回复收藏 0 原文

~没有更多了~

关于作者

旧时模样

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

Python 多行正则表达式 +一次性读取一个文件的多个条目

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

编辑

EDIT

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

Python 多行正则表达式 +一次性读取一个文件的多个条目

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

编辑

EDIT

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。