Python:正则表达式问题/CSV 解析/Psycopg 嵌套数组
我在解析 Psycopg2 返回的嵌套数组时遇到问题。我正在处理的数据库返回可以将嵌套数组作为值的记录。 Psycopg 仅解析此类值的外部数组。
我的第一个方法是用逗号分割字符串,但后来我遇到了一个问题,有时结果中的字符串也包含逗号,这使得整个方法无法使用。 我的下一次尝试是使用正则表达式来查找字符串中的“组件”,但后来我注意到我无法检测数字(因为数字也可以出现在字符串中)。
目前,这是我的代码:
import re
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
r = re.compile('\".*?\"|[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}|^\d*[0-9](|.\d*[0-9]|,\d*[0-9])?$')
result = r.search(text)
if result:
result = result.groups()
结果应该是:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e', 'Marc, Dirk en Koen', 398547, 85.5, -9.2, '62fe6393-00f7-418d-b0b3-7116f6d5cf10']
由于我希望此功能具有通用性,因此我无法确定参数的顺序。我只知道支持的类型是字符串、uuid、(有符号)整数和(有符号)小数。
我使用了错误的方法吗?或者有人能指出我正确的方向吗?
提前致谢!
I'm having trouble parsing nested array's returned by Psycopg2. The DB I'm working on returns records that can have nested array's as value. Psycopg only parses the outer array of such values.
My first approach was splitting the string on comma's, but then I ran into the problem that sometimes a string within the result also contains comma's, which renders the entire approach unusable.
My next attempt was using regex to find the "components" within the string, but then I noticed I wasn't able to detect numbers (since numbers can also occur within strings).
Currently, this is my code:
import re
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
r = re.compile('\".*?\"|[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}|^\d*[0-9](|.\d*[0-9]|,\d*[0-9])?
The result of this should be:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e', 'Marc, Dirk en Koen', 398547, 85.5, -9.2, '62fe6393-00f7-418d-b0b3-7116f6d5cf10']
Since I would like to have this functionality generic, I cannot be certain of the order of arguments. I only know that the types that are supported are strings, uuid's, (signed) integers and (signed) decimals.
Am I using a wrong approach? Or can anyone point me in the right direction?
Thanks in advance!
)
result = r.search(text)
if result:
result = result.groups()
The result of this should be:
Since I would like to have this functionality generic, I cannot be certain of the order of arguments. I only know that the types that are supported are strings, uuid's, (signed) integers and (signed) decimals.
Am I using a wrong approach? Or can anyone point me in the right direction?
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
Python 的原生库应该可以很好地工作。你已经尝试过了吗?
http://docs.python.org/library/csv.html
Python's native lib should do a good work. Have you tried it already?
http://docs.python.org/library/csv.html
从您的示例来看,它看起来像
^{(?:(?:([^},"']+|"[^"]+"|'[^']+')(?:,| }))+(?<=})|})$ 给我。这并不完美,因为它允许“{foo,bar}baz}”,但如果这对您很重要,则可以修复它。
From your sample, it looks something like
^{(?:(?:([^},"']+|"[^"]+"|'[^']+')(?:,|}))+(?<=})|})$
to me. That's not perfect since it would allow "{foo,bar}baz}", but it could be fixed if that matters to you.如果你能进行断言,这会让你走上正轨。
这个问题太大了,无法在单个正则表达式中完成。您正在尝试在全局匹配中同时验证和解析。但您的预期结果需要在比赛后进行子处理。因此,最好编写一个更简单的全局解析器,然后迭代验证和修复的结果(是的,您的示例中规定了修复)。
两个主要的解析正则表达式是:
也删除定界符引号,只有 $2 包含数据,在 while 循环中使用,全局上下文
/(?!}$)(?:^{?|,)\s*("|)(.*?)\1\s*(?=,|}$)/
< /p>/(?!}$)(?:^{?|,)\s*(".*?"|.*?)\s*(?=,|}$)/
< /p>这个是一个后处理示例(在 Perl 中),带有记录的正则表达式:(编辑:修复附加尾随,)
输出:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e','Marc, Dirk en Koen',398547,85.5,-9.2,'6
2fe6393-00f7-418d-b0b3-7116f6d5cf10']
If you can do ASSERTIONS, this will get you on the right track.
This problem is too extensive to be done in a single regex. You are trying to validate and parse at the same time in a global match. But your intented result requires sub-processing after the match. For that reason, its better to write a simpler global parser, then itterate over the results for validation and fixup (yes, you have fixup stipulated in your example).
The two main parsing regex's are these:
strips delimeter quote too and only $2 contains data, use in a while loop, global context
/(?!}$)(?:^{?|,)\s*("|)(.*?)\1\s*(?=,|}$)/
my preferred one, does not strip quotes, only captures $1, can use to capture in an array or in a while loop, global context
/(?!}$)(?:^{?|,)\s*(".*?"|.*?)\s*(?=,|}$)/
This is an example of post processing (in Perl) with a documented regex: (edit: fix append trailing ,)
Output:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e','Marc, Dirk en Koen',398547,85.5,-9.2,'6
2fe6393-00f7-418d-b0b3-7116f6d5cf10']
看来 CSV 方法是最容易实现的:
感谢您的回复,他们非常有帮助!
It seemed that the CSV approach was the easiest to implement:
Thanks for the responses, they were most helpfull!
改进了德克的答案。这可以更好地处理转义字符以及空数组的情况。还少了一次剥离调用:
Improved upon Dirk's answer. This handles escape characters better as well as the empty array case. One less strip call as well: