使用 pyparsing 解析正则表达式列表(字面意思)

发布于 2024-10-04 05:10:58 字数 386 浏览 3 评论 0原文

我想解析正则表达式列表以计算每个正则表达式在特定文本/字符串中找到与其匹配的可能性...

例如。在长度为 1 的字符串中查找“[AB]”应该约为 1/13(仅考虑大写字母)。

是否有一个通用的正则表达式解析器,它返回各个位置/替代方案? 我正在考虑获取职位列表作为返回 ('[AB].A{2}' 会产生 '[['A','B'],'.' ,['AA']')

问题是用pyparsing解析正则表达式。 简单的正则表达式没有问题,但是当涉及到“替代”重复时,我迷失了:我发现很难解析像 '((A[AB])| 这样的嵌套表达式。 (AB))'。

有什么想法吗?

I'd like to parse a list of regular expressions to calculate the likelihood of each to find a match to it in a certain text/string...

Eg. finding '[AB]' in a string of length 1 should be something around 1/13 (considering only captial letters).

Is there a generic regex parser, which returns the individual positions/alternatives?
I'm thinking of getting a list of positions as return ('[AB].A{2}' would yield '[['A','B'],'.',['AA']')

The problem is the parsing of regular expressions with pyparsing.
Simple regexes are no problem, but when it comes to "alternatives" and repetitions, I'm lost: I find it hard to parse nested expressions like '((A[AB])|(AB))'.

Any thoughts?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

乙白 2024-10-11 05:10:58

模拟而不是计算可能是正确的出路。

设置一组代表性文本字符串。 (语言学家将这样的集合称为语料库。)对于任何给定的正则表达式,找到它匹配的字符串数量,然后除以语料库中的字符串总数。

您自己给出的“[AB]”可能性为 1/13 的示例是基于这种思维方式,使用单大写字母字符串的语料库。通过查看语料库中的 26 个字符串中有两个匹配项,您得到了 1/13。

创建更大的语料库:可能是达到一定长度的所有字母数字字符串的集合,或者是达到一定长度的所有 ASCII 字符串的集合,或者是您选择的字典。思考什么语料库最适合您的目的是澄清“可能性”含义的好​​方法。

Simulation rather than calculation may be the way to go.

Set up a population of representative text strings. (Linguists would call such a set a corpus.) For any given regex, find the number of strings it matches, and divide by the total number of strings in your corpus.

Your own example giving the likelihood of '[AB]' as 1/13 is based on this way of thinking, using the corpus of single-capital-letter strings. You got 1/13 by seeing that there are two matches out of the 26 strings in the corpus.

Create a larger corpus: maybe the set of all alphanumeric strings up to a certain length, or all ASCII strings up to a certain length, or the dictionary of your choice. Thinking about what corpus best suits your purpose is a good way to clarify what you mean by "likelihood".

江挽川 2024-10-11 05:10:58

你使用 ['A', 'B'] 来表示:或 A 或 B。然后你可以这样写:

'[{'A', ['A', 'B']}, {'A', 'B'}]'

在那里你使用 [] 来“其中一个”,就像使用 {} 来“所有这些”

1/2 to '{'A', ['A', 'B']}'
   'A' => 1/1
   ['A', 'B'] => 1/2
   (1/1) * (1/2) = 1/2
   this (1/2) times the extern (1/2) = (1/4)
1/2 to '{'A', 'B'}' -> (1/26) to each.
Multiplify two times: 1/(26^2) and multiplify by the 1/2 = (1/(26^2))/2.

Now multiplify both:  (1/4) * ((1/(26^2))/2)

一样这是一个非常糟糕的解释...我会重试...

[] => Calc de probability: {probability of each term} / {num of terms}
{} => Calc de probability of each term and multiplify all

明白吗?

You use ['A', 'B'] to say: or A or B. then you can put some thing like this:

'[{'A', ['A', 'B']}, {'A', 'B'}]'

At there you use [] to "one of these" as use {} to "all these"

1/2 to '{'A', ['A', 'B']}'
   'A' => 1/1
   ['A', 'B'] => 1/2
   (1/1) * (1/2) = 1/2
   this (1/2) times the extern (1/2) = (1/4)
1/2 to '{'A', 'B'}' -> (1/26) to each.
Multiplify two times: 1/(26^2) and multiplify by the 1/2 = (1/(26^2))/2.

Now multiplify both:  (1/4) * ((1/(26^2))/2)

It was a so bad explanation... I'll retry...

[] => Calc de probability: {probability of each term} / {num of terms}
{} => Calc de probability of each term and multiplify all

understand?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文