Python:正则表达式问题/CSV 解析/Psycopg 嵌套数组

发布于 2024-10-17 08:14:52 字数 833 浏览 5 评论 0原文

我在解析 Psycopg2 返回的嵌套数组时遇到问题。我正在处理的数据库返回可以将嵌套数组作为值的记录。 Psycopg 仅解析此类值的外部数组。

我的第一个方法是用逗号分割字符串,但后来我遇到了一个问题,有时结果中的字符串也包含逗号,这使得整个方法无法使用。 我的下一次尝试是使用正则表达式来查找字符串中的“组件”,但后来我注意到我无法检测数字(因为数字也可以出现在字符串中)。

目前,这是我的代码:

import re
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
r = re.compile('\".*?\"|[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}|^\d*[0-9](|.\d*[0-9]|,\d*[0-9])?$')
result = r.search(text)
if result:
    result = result.groups()

结果应该是:

['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e', 'Marc, Dirk en Koen', 398547, 85.5, -9.2, '62fe6393-00f7-418d-b0b3-7116f6d5cf10']

由于我希望此功能具有通用性,因此我无法确定参数的顺序。我只知道支持的类型是字符串、uuid、(有符号)整数和(有符号)小数。

我使用了错误的方法吗?或者有人能指出我正确的方向吗?

提前致谢!

I'm having trouble parsing nested array's returned by Psycopg2. The DB I'm working on returns records that can have nested array's as value. Psycopg only parses the outer array of such values.

My first approach was splitting the string on comma's, but then I ran into the problem that sometimes a string within the result also contains comma's, which renders the entire approach unusable.
My next attempt was using regex to find the "components" within the string, but then I noticed I wasn't able to detect numbers (since numbers can also occur within strings).

Currently, this is my code:

import re
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
r = re.compile('\".*?\"|[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}|^\d*[0-9](|.\d*[0-9]|,\d*[0-9])?

The result of this should be:

['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e', 'Marc, Dirk en Koen', 398547, 85.5, -9.2, '62fe6393-00f7-418d-b0b3-7116f6d5cf10']

Since I would like to have this functionality generic, I cannot be certain of the order of arguments. I only know that the types that are supported are strings, uuid's, (signed) integers and (signed) decimals.

Am I using a wrong approach? Or can anyone point me in the right direction?

Thanks in advance!

) result = r.search(text) if result: result = result.groups()

The result of this should be:

Since I would like to have this functionality generic, I cannot be certain of the order of arguments. I only know that the types that are supported are strings, uuid's, (signed) integers and (signed) decimals.

Am I using a wrong approach? Or can anyone point me in the right direction?

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

孤独难免 2024-10-24 08:14:52

Python 的原生库应该可以很好地工作。你已经尝试过了吗?

http://docs.python.org/library/csv.html

Python's native lib should do a good work. Have you tried it already?

http://docs.python.org/library/csv.html

淡莣 2024-10-24 08:14:52

从您的示例来看,它看起来像 ^{(?:(?:([^},"']+|"[^"]+"|'[^']+')(?:,| }))+(?<=})|})$ 给我。这并不完美,因为它允许“{foo,bar}baz}”,但如果这对您很重要,则可以修复它。

From your sample, it looks something like ^{(?:(?:([^},"']+|"[^"]+"|'[^']+')(?:,|}))+(?<=})|})$ to me. That's not perfect since it would allow "{foo,bar}baz}", but it could be fixed if that matters to you.

对你的占有欲 2024-10-24 08:14:52

如果你能进行断言,这会让你走上正轨。

这个问题太大了,无法在单个正则表达式中完成。您正在尝试在全局匹配中同时验证和解析。但您的预期结果需要在比赛后进行子处理。因此,最好编写一个更简单的全局解析器,然后迭代验证和修复的结果(是的,您的示例中规定了修复)。

两个主要的解析正则表达式是:

  1. 也删除定界符引号,只有 $2 包含数据,在 while 循环中使用,全局上下文
    /(?!}$)(?:^{?|,)\s*("|)(.*?)\1\s*(?=,|}$)/ < /p>

  2. < p>我的首选,不去掉引号,只捕获 $1,可以用于在数组或 while 循环中捕获,全局上下文
    /(?!}$)(?:^{?|,)\s*(".*?"|.*?)\s*(?=,|}$)/< /p>

这个是一个后处理示例(在 Perl 中),带有记录的正则表达式:(编辑:修复附加尾随,

use strict; use warnings;

my $str = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}';

my $rx = qr/ (?!}$) (?:^{?|,) \s* ( ".*?" | .*?) \s* (?=,|}$) /x;

my $rxExpanded = qr/
         (?!}$)           # ASSERT ahead:  NOT a } plus end
         (?:^{?|,)        # Boundry: Start of string plus { OR comma
         \s*              # 0 or more whitespace
         ( ".*?" | .*?)   # Capture "Quoted" or non quoted data
         \s*              # 0 or more whitespace
         (?=,|}$)         # Boundry ASSERT ahead:  Comma OR } plus end
  /x;

my ($newstring, $sucess) = ('[', 0);

for my $field ($str =~ /$rx/g)
{
   my $tmp = $field;
   $sucess = 1;

   if (  $tmp =~ s/^"|"$//g || $tmp =~ /(?:[a-f0-9]+-){3,}/ ) {
      $tmp = "'$tmp'";
   }
   $newstring .= "$tmp,";
}
if ( $sucess ) {
    $newstring =~ s/,$//;
    $newstring .= ']';
    print $newstring,"\n";
}
else {
    print "Invalid string!\n";
}

输出:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb​​45e','Marc, Dirk en Koen',398547,85.5,-9.2,'6
2fe6393-00f7-418d-b0b3-7116f6d5cf10']

If you can do ASSERTIONS, this will get you on the right track.

This problem is too extensive to be done in a single regex. You are trying to validate and parse at the same time in a global match. But your intented result requires sub-processing after the match. For that reason, its better to write a simpler global parser, then itterate over the results for validation and fixup (yes, you have fixup stipulated in your example).

The two main parsing regex's are these:

  1. strips delimeter quote too and only $2 contains data, use in a while loop, global context
    /(?!}$)(?:^{?|,)\s*("|)(.*?)\1\s*(?=,|}$)/

  2. my preferred one, does not strip quotes, only captures $1, can use to capture in an array or in a while loop, global context
    /(?!}$)(?:^{?|,)\s*(".*?"|.*?)\s*(?=,|}$)/

This is an example of post processing (in Perl) with a documented regex: (edit: fix append trailing ,)

use strict; use warnings;

my $str = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}';

my $rx = qr/ (?!}$) (?:^{?|,) \s* ( ".*?" | .*?) \s* (?=,|}$) /x;

my $rxExpanded = qr/
         (?!}$)           # ASSERT ahead:  NOT a } plus end
         (?:^{?|,)        # Boundry: Start of string plus { OR comma
         \s*              # 0 or more whitespace
         ( ".*?" | .*?)   # Capture "Quoted" or non quoted data
         \s*              # 0 or more whitespace
         (?=,|}$)         # Boundry ASSERT ahead:  Comma OR } plus end
  /x;

my ($newstring, $sucess) = ('[', 0);

for my $field ($str =~ /$rx/g)
{
   my $tmp = $field;
   $sucess = 1;

   if (  $tmp =~ s/^"|"$//g || $tmp =~ /(?:[a-f0-9]+-){3,}/ ) {
      $tmp = "'$tmp'";
   }
   $newstring .= "$tmp,";
}
if ( $sucess ) {
    $newstring =~ s/,$//;
    $newstring .= ']';
    print $newstring,"\n";
}
else {
    print "Invalid string!\n";
}

Output:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e','Marc, Dirk en Koen',398547,85.5,-9.2,'6
2fe6393-00f7-418d-b0b3-7116f6d5cf10']

最美的太阳 2024-10-24 08:14:52

看来 CSV 方法是最容易实现的:

def parsePsycopgSQLArray(input):
    import csv
    import cStringIO

    input = input.strip("{")
    input = input.strip("}")

    buffer = cStringIO.StringIO(input)
    reader = csv.reader(buffer, delimiter=',', quotechar='"')   

    return reader.next() #There can only be one row 

if __name__ == "__main__":
    text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}' 
    result = parsePsycopgSQLArray(text)
    print result

感谢您的回复,他们非常有帮助!

It seemed that the CSV approach was the easiest to implement:

def parsePsycopgSQLArray(input):
    import csv
    import cStringIO

    input = input.strip("{")
    input = input.strip("}")

    buffer = cStringIO.StringIO(input)
    reader = csv.reader(buffer, delimiter=',', quotechar='"')   

    return reader.next() #There can only be one row 

if __name__ == "__main__":
    text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}' 
    result = parsePsycopgSQLArray(text)
    print result

Thanks for the responses, they were most helpfull!

肩上的翅膀 2024-10-24 08:14:52

改进了德克的答案。这可以更好地处理转义字符以及空数组的情况。还少了一次剥离调用:

def restore_str_array(val):
    """
    Converts a postgres formatted string array (as a string) to python

    :param val: postgres string array
    :return: python array with values as strings
    """
    val = val.strip("{}")
    if not val:
        return []
    reader = csv.reader(StringIO(val), delimiter=',', quotechar='"', escapechar='\\')
    return reader.next()

Improved upon Dirk's answer. This handles escape characters better as well as the empty array case. One less strip call as well:

def restore_str_array(val):
    """
    Converts a postgres formatted string array (as a string) to python

    :param val: postgres string array
    :return: python array with values as strings
    """
    val = val.strip("{}")
    if not val:
        return []
    reader = csv.reader(StringIO(val), delimiter=',', quotechar='"', escapechar='\\')
    return reader.next()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文