python regex:捕获多个字符串中包含空格的部分

发布于 2024-10-19 23:21:50 字数 2482 浏览 1 评论 0原文

方案”的字符串中捕获子字符串

'some string, another string, '

我正在尝试从看起来类似于“我希望结果匹配组成为

('some string', 'another string')

我当前的解决

>>> from re import match
>>> match(2 * '(.*?), ', 'some string, another string, ').groups()
('some string', 'another string')

,但不切实际 - 我在这里展示的内容当然在复杂性方面大大降低了我是在真实的项目中做的;我只想使用一种“直接”(非计算)正则表达式模式。不幸的是,到目前为止我的尝试失败了:

这不匹配(没有结果),因为 {2} 仅应用于空格,而不是整个字符串:

>>> match('.*?, {2}', 'some string, another string, ')

在重复字符串周围添加括号会在结果

>>> match('(.*?, ){2}', 'some string, another string, ').groups()
('another string, ',)

添加另一组括号确实解决了这个问题,但让我太多了:

>>> match('((.*?), ){2}', 'some string, another string, ').groups()
('another string, ', 'another string')

添加非捕获修饰符改善了结果,但仍然错过了第一个字符串

>>> match('(?:(.*?), ){2}', 'some string, another string, ').groups()
('another string',)

我觉得我很接近,但我似乎找不到正确的方法。

谁能帮助我吗?我没有看到任何其他方法吗?


前几次回复后更新:

首先,非常感谢大家,非常感谢您的帮助! :-)

正如我在原来的帖子中所说,为了描述实际的核心问题,我在问题中省略了很多复杂性。首先,在我正在从事的项目中,我正在解析大量文件(目前每天数万个),其格式为多种(当前为 5 个,很快约为 25 个,以后可能为数百个)不同的基于行的格式。还有 XML、JSON、二进制和其他一些数据文件格式,但让我们集中精力。

为了处理多种文件格式并利用其中许多文件格式是基于行的这一事实,我创建了一个有点通用的 Python 模块,该模块一个接一个地加载文件,对每一行应用正则表达式并返回一个大的文件格式。数据结构与之匹配。该模块是一个原型,出于性能原因,生产版本将需要 C++ 版本,该版本将通过 Boost::Python 连接,并且可能会将正则表达式方言的主题添加到复杂性列表中。

另外,没有 2 次重复,但数量在当前 0 到 70(左右)之间变化,逗号并不总是逗号,尽管我最初说过,正则表达式模式的某些部分必须在运行时计算;假设我有理由尝试减少“动态”数量并拥有尽可能多的“固定”模式。

所以,总而言之:我必须使用正则表达式。


尝试改写:我认为问题的核心归结为:是否存在例如涉及的 Python RegEx 表示法大括号重复并允许我捕捉

'some string, another string, '

('some string', 'another string')

?

嗯,这可能把范围缩小得太远了 - 但是,你这样做的任何方式都是错误的:-D


第二次尝试改写:为什么我看不到第一个字符串(“某个字符串”)结果?为什么正则表达式会产生一个匹配项(表明必须有 2 个字符串),但只返回 1 个字符串(第二个字符串)?

即使我使用非数字重复,即使用 + 而不是 {2},问题仍然存在:

>>> match('(?:(.*?), )+', 'some string, another string, ').groups()
('another string',)

另外,返回的不是第二个字符串,而是最后一个:

>>> match('(?:(.*?), )+', 'some string, another string, third string, ').groups()
('third string',)

再次感谢您的帮助,永远令人惊叹当我试图找出我真正想知道的内容时,同行评审对我有多大帮助......

I am trying to capture sub-strings from a string that looks similar to

'some string, another string, '

I want the result match group to be

('some string', 'another string')

my current solution

>>> from re import match
>>> match(2 * '(.*?), ', 'some string, another string, ').groups()
('some string', 'another string')

works, but is not practicable - what I am showing here of course is massively reduced in terms of complexity compared to what I'm doing in the real project; I want to use one 'straight' (non-computed) regex pattern only. Unfortunately, my attempts have failed so far:

This doesn't match (None as result), because {2} is applied to the space only, not to the whole string:

>>> match('.*?, {2}', 'some string, another string, ')

adding parentheses around the repeated string has the comma and space in the result

>>> match('(.*?, ){2}', 'some string, another string, ').groups()
('another string, ',)

adding another set of parantheses does fix that, but gets me too much:

>>> match('((.*?), ){2}', 'some string, another string, ').groups()
('another string, ', 'another string')

adding a non-capturing modifier improves the result, but still misses the first string

>>> match('(?:(.*?), ){2}', 'some string, another string, ').groups()
('another string',)

I feel like I'm close, but I can't really seem to find the proper way.

Can anyone help me ? Any other approaches I'm not seeing ?


Update after the first few responses:

First up, thank you very much everyone, your help is greatly appreciated! :-)

As I said in the original post, I have omitted a lot of complexity in my question for the sake of depicting the actual core problem. For starters, in the project I am working on, I am parsing large amounts of files (currently tens of thousands per day) in a number (currently 5, soon ~25, possibly in the hundreds later) of different line-based formats. There is also XML, JSON, binary and some other data file formats, but let's stay focussed.

In order to cope with the multitude of file formats and to exploit the fact that many of them are line-based, I have created a somewhat generic Python module that loads one file after the other, applies a regex to every line and returns a large data structure with the matches. This module is a prototype, the production version will require a C++ version for performance reason which will be connected over Boost::Python and will probably add the subject of regex dialects to the list of complexities.

Also, there are not 2 repetitions, but an amount varying between currently zero and 70 (or so), the comma is not always a comma and despite what I said originally, some parts of the regex pattern will have to be computed at runtime; let's just say I have reason to try and reduce the 'dynamic' amount and have as much 'fixed' pattern as possible.

So, in a word: I must use regular expressions.


Attempt to rephrase: I think the core of the problem boils down to: Is there a Python RegEx notation that e.g. involves curly braces repetitions and allows me to capture

'some string, another string, '

into

('some string', 'another string')

?

Hmmm, that probably narrows it down too far - but then, any way you do it is wrong :-D


Second attempt to rephrase: Why do I not see the first string ('some string') in the result ? Why does the regex produce a match (indicating there's gotta be 2 of something), but only returns 1 string (the second one) ?

The problem remains the same even if I use non-numeric repetition, i.e. using + instead of {2}:

>>> match('(?:(.*?), )+', 'some string, another string, ').groups()
('another string',)

Also, it's not the second string that's returned, it is the last one:

>>> match('(?:(.*?), )+', 'some string, another string, third string, ').groups()
('third string',)

Again, thanks for your help, never ceases to amaze me how helpful peer review is while trying to find out what I actually want to know...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

自由范儿 2024-10-26 23:21:50

除非这个问题比您解释的要多得多,否则我不认为使用正则表达式有什么意义。使用基本字符串方法处理起来非常简单:

[s.strip() for s in mys.split(',') if s.strip()]

或者如果它必须是一个元组:

tuple(s.strip() for s in mys.split(',') if s.strip())

代码也更具可读性。如果申请失败,请告诉我。


编辑:好吧,这个问题确实比最初看起来的要多。不过,将其保留用于历史目的。 (我猜我没有“自律”:))

Unless there's much more to this problem than you've explained, I don't see the point in using regexes. This is very simple to deal with using basic string methods:

[s.strip() for s in mys.split(',') if s.strip()]

Or if it has to be a tuple:

tuple(s.strip() for s in mys.split(',') if s.strip())

The code is more readable too. Please tell me if this fails to apply.


EDIT: Ok, there is indeed more to this problem than it initially seemed. Leaving this for historical purposes though. (Guess I'm not 'disciplined' :) )

暗地喜欢 2024-10-26 23:21:50

如上所述,我认为这个正则表达式工作正常:

import re
thepattern = re.compile("(.+?)(?:,|$)") # lazy non-empty match 
thepattern.findall("a, b, asdf, d")     # until comma or end of line
# Result:
Out[19]: ['a', ' b', ' asdf', ' d']

这里的关键是使用 findall 而不是匹配。您的问题的措辞表明您更喜欢 match,但它不是适合此处工作的正确工具 - 它旨在为每个相应的组 ( 正则表达式中的 )。由于“字符串数量”是可变的,因此正确的方法是使用 findallsplit

如果这不是您所需要的,请让问题更具体。

编辑:如果您必须使用元组而不是列表:

tuple(Out[19])
# Result
Out[20]: ('a', ' b', ' asdf', ' d')

As described, I think this regex works fine:

import re
thepattern = re.compile("(.+?)(?:,|$)") # lazy non-empty match 
thepattern.findall("a, b, asdf, d")     # until comma or end of line
# Result:
Out[19]: ['a', ' b', ' asdf', ' d']

The key here is to use findall rather than match. The phrasing of your question suggests you prefer match, but it isn't the right tool for the job here -- it is designed to return exactly one string for each corresponding group ( ) in the regex. Since your 'number of strings' is variable, the right approach is to use either findall or split.

If this isn't what you need, then please make the question more specific.

Edit: And if you must use tuples rather than lists:

tuple(Out[19])
# Result
Out[20]: ('a', ' b', ' asdf', ' d')
£冰雨忧蓝° 2024-10-26 23:21:50
import re

regex = " *((?:[^, ]| +[^, ])+) *, *((?:[^, ]| +[^, ])+) *, *"

print re.match(regex, 'some string, another string, ').groups()
# ('some string', 'another string')
print re.match(regex, ' some string, another string, ').groups()
# ('some string', 'another string')
print re.match(regex, ' some string , another string, ').groups()
# ('some string', 'another string')
import re

regex = " *((?:[^, ]| +[^, ])+) *, *((?:[^, ]| +[^, ])+) *, *"

print re.match(regex, 'some string, another string, ').groups()
# ('some string', 'another string')
print re.match(regex, ' some string, another string, ').groups()
# ('some string', 'another string')
print re.match(regex, ' some string , another string, ').groups()
# ('some string', 'another string')
默嘫て 2024-10-26 23:21:50

无意冒犯,但您显然有很多关于正则表达式的知识需要学习,而最终您将学到的是正则表达式无法处理这项工作。我确信这个特定任务可以使用正则表达式完成,但是然后呢?您说您可能有数百种不同的文件格式需要解析!您甚至提到了 JSON 和 XML,它们从根本上与正则表达式不兼容。

帮自己一个忙:忘记正则表达式并学习 pyparsing 。或者完全跳过 Python 并使用独立的解析器生成器,例如 ANTLR。无论哪种情况,您可能会发现大多数文件格式的语法已经编写完毕。

No offense, but you obviously have a lot to learn about regexes, and what you're going to learn, ultimately, is that regexes can't handle this job. I'm sure this particular task is doable with regexes, but then what? You say you have potentially hundreds of different file formats to parse! You even mentioned JSON and XML, which are fundamentally incompatible with regexes.

Do yourself a favor: forget about regexes and learn pyparsing instead. Or skip Python entirely and use a standalone parser generator like ANTLR. In either case, you'll probably find that grammars for most of your file formats have already been written.

心房敞 2024-10-26 23:21:50

我认为问题的核心在于
深入到:是否有Python正则表达式
表示法eg涉及卷曲
大括号重复并允许我
捕获'一些字符串,另一个字符串,
'?

我认为不存在这样的符号。

但正则表达式不仅仅是 NOTATION 的问题,也就是说用于定义正则表达式的 RE 字符串。这也是一个TOOLS的问题,也就是功能的问题。

不幸的是,我不能使用 findall 作为
最初问题中的字符串
这只是问题的一部分
真正的字符串要长得多,所以
findall 仅在我执行多个操作时才有效
正则表达式查找/匹配/搜索。

您应该立即提供更多信息:我们可以更快地了解限制是什么。因为在我看来,要回答您已经暴露的问题, findall() 确实可以:

import re

for line in ('string one, string two, ',
             'some string, another string, third string, ',
             # the following two lines are only one string
             'Topaz, Turquoise, Moss Agate, Obsidian, '
             'Tigers-Eye, Tourmaline, Lapis Lazuli, '):

    print re.findall('(.+?), *',line)

结果

['string one', 'string two']
['some string', 'another string', 'third string']
['Topaz', 'Turquoise', 'Moss Agate', 'Obsidian', 'Tigers-Eye', 'Tourmaline', 'Lapis Lazuli']

现在,因为您在问题中“省略了很多复杂性”, findall() 可能不足以容纳这种复杂性。然后将使用 finditer() ,因为它可以更灵活地选择

import re

for line in ('string one, string two, ',
             'some string, another string, third string, ',
             # the following two lines are only one string
             'Topaz, Turquoise, Moss Agate, Obsidian, '
             'Tigers-Eye, Tourmaline, Lapis Lazuli, '):

    print [ mat.group(1) for mat in re.finditer('(.+?), *',line) ]

给出相同结果的匹配组,并且可以通过编写其他表达式代替 ma​​t.group(1 )

I think the core of the problem boils
down to: Is there a Python RegEx
notation that e.g. involves curly
braces repetitions and allows me to
capture 'some string, another string,
' ?

I don't think there is such a notation.

But regexes are not a matter of only NOTATION , that is to say the RE string used to define a regex. It is also a matter of TOOLS, that is to say functions.

Unfortunately, I can't use findall as
the string from the initial question
is only a part of the problem, the
real string is a lot longer, so
findall only works if I do multiple
regex findalls / matches / searches.

You should give more information without delaying: we could understand more rapidly what are the constraints. Because in my opinion, to answer to your problem as it has been exposed, findall() is indeed OK:

import re

for line in ('string one, string two, ',
             'some string, another string, third string, ',
             # the following two lines are only one string
             'Topaz, Turquoise, Moss Agate, Obsidian, '
             'Tigers-Eye, Tourmaline, Lapis Lazuli, '):

    print re.findall('(.+?), *',line)

Result

['string one', 'string two']
['some string', 'another string', 'third string']
['Topaz', 'Turquoise', 'Moss Agate', 'Obsidian', 'Tigers-Eye', 'Tourmaline', 'Lapis Lazuli']

Now, since you "have omitted a lot of complexity" in your question, findall() could incidentally be unsufficient to hold this complexity. Then finditer() will be used because it allows more flexibility in the selection of groups of a match

import re

for line in ('string one, string two, ',
             'some string, another string, third string, ',
             # the following two lines are only one string
             'Topaz, Turquoise, Moss Agate, Obsidian, '
             'Tigers-Eye, Tourmaline, Lapis Lazuli, '):

    print [ mat.group(1) for mat in re.finditer('(.+?), *',line) ]

gives the same result and can be complexified by writing other expression in place of mat.group(1)

清君侧 2024-10-26 23:21:50

为了总结这一点,我似乎已经通过以“动态”方式构建正则表达式模式来使用最佳解决方案:

>>> from re import match
>>> match(2 * '(.*?), ', 'some string, another string, ').groups()
('some string', 'another string')

2 * '(.*?)

就是我所说的动态。 替代方法

>>> match('(?:(.*?), ){2}', 'some string, another string, ').groups()
('another string',)

由于以下事实,

无法返回所需的结果(正如 Glenn 和 Alan 善意地解释的那样)

如果匹配,捕获的内容将被覆盖
每次重复捕获

谢谢大家的帮助! :-)

In order to sum this up, it seems I am already using the best solution by constructing the regex pattern in a 'dynamic' manner:

>>> from re import match
>>> match(2 * '(.*?), ', 'some string, another string, ').groups()
('some string', 'another string')

the

2 * '(.*?)

is what I mean by dynamic. The alternative approach

>>> match('(?:(.*?), ){2}', 'some string, another string, ').groups()
('another string',)

fails to return the desired result due to the fact that (as Glenn and Alan kindly explained)

with match, the captured content gets overwritten
with each repetition of the capturing
group

Thanks for your help everyone! :-)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文