使用 pyparsing 解析正则表达式列表(字面意思)
我想解析正则表达式列表以计算每个正则表达式在特定文本/字符串中找到与其匹配的可能性...
例如。在长度为 1 的字符串中查找“[AB]
”应该约为 1/13(仅考虑大写字母)。
是否有一个通用的正则表达式解析器,它返回各个位置/替代方案? 我正在考虑获取职位列表作为返回 ('[AB].A{2}
' 会产生 '[['A','B'],'.' ,['AA']
')
问题是用pyparsing解析正则表达式。 简单的正则表达式没有问题,但是当涉及到“替代”和重复时,我迷失了:我发现很难解析像 '((A[AB])| 这样的嵌套表达式。 (AB))
'。
有什么想法吗?
I'd like to parse a list of regular expressions to calculate the likelihood of each to find a match to it in a certain text/string...
Eg. finding '[AB]
' in a string of length 1 should be something around 1/13 (considering only captial letters).
Is there a generic regex parser, which returns the individual positions/alternatives?
I'm thinking of getting a list of positions as return ('[AB].A{2}
' would yield '[['A','B'],'.',['AA']
')
The problem is the parsing of regular expressions with pyparsing.
Simple regexes are no problem, but when it comes to "alternatives" and repetitions, I'm lost: I find it hard to parse nested expressions like '((A[AB])|(AB))
'.
Any thoughts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
模拟而不是计算可能是正确的出路。
设置一组代表性文本字符串。 (语言学家将这样的集合称为语料库。)对于任何给定的正则表达式,找到它匹配的字符串数量,然后除以语料库中的字符串总数。
您自己给出的“[AB]”可能性为 1/13 的示例是基于这种思维方式,使用单大写字母字符串的语料库。通过查看语料库中的 26 个字符串中有两个匹配项,您得到了 1/13。
创建更大的语料库:可能是达到一定长度的所有字母数字字符串的集合,或者是达到一定长度的所有 ASCII 字符串的集合,或者是您选择的字典。思考什么语料库最适合您的目的是澄清“可能性”含义的好方法。
Simulation rather than calculation may be the way to go.
Set up a population of representative text strings. (Linguists would call such a set a corpus.) For any given regex, find the number of strings it matches, and divide by the total number of strings in your corpus.
Your own example giving the likelihood of '[AB]' as 1/13 is based on this way of thinking, using the corpus of single-capital-letter strings. You got 1/13 by seeing that there are two matches out of the 26 strings in the corpus.
Create a larger corpus: maybe the set of all alphanumeric strings up to a certain length, or all ASCII strings up to a certain length, or the dictionary of your choice. Thinking about what corpus best suits your purpose is a good way to clarify what you mean by "likelihood".
你使用 ['A', 'B'] 来表示:或 A 或 B。然后你可以这样写:
在那里你使用 [] 来“其中一个”,就像使用 {} 来“所有这些”
一样这是一个非常糟糕的解释...我会重试...
明白吗?
You use ['A', 'B'] to say: or A or B. then you can put some thing like this:
At there you use [] to "one of these" as use {} to "all these"
It was a so bad explanation... I'll retry...
understand?