我正在使用 Amazon Mechanical Turk API,它只允许我使用正则表达式来过滤数据字段。
我想向函数输入一个整数范围,例如 256-311 或 45-1233,并返回仅匹配该范围的正则表达式。
匹配 256-321 的正则表达式将是:
\b((25[6-9])|(2[6-9][0-9])|(3[0-1][0-9])|(32[0-1]))\b
这部分相当简单,但我在创建此正则表达式的循环时遇到问题。
我正在尝试构建一个如下定义的函数:
function getRangeRegex( int fromInt, int toInt)
{
return regexString;
}
我浏览了整个网络,令我惊讶的是,过去似乎没有人解决过这个问题。这是一个难题...
感谢您的宝贵时间。
I am working with the Amazon Mechanical Turk API and it will only allow me to use regular expressions to filter a field of data.
I would like to input an integer range to a function, such as 256-311 or 45-1233, and return a regex that would match only that range.
A regex matching 256-321 would be:
\b((25[6-9])|(2[6-9][0-9])|(3[0-1][0-9])|(32[0-1]))\b
That part is fairly easy, but I am having trouble with the loop to create this regex.
I am trying to build a function defined like this:
function getRangeRegex( int fromInt, int toInt)
{
return regexString;
}
I looked all over the web and I am surprised that it doesn't look like anyone has solved this in the past. It is a difficult problem...
Thanks for your time.
发布评论
评论(9)
这是一个快速破解:
它会产生:
未经正确测试,使用风险自负!
是的,在许多情况下生成的正则表达式可以写得更紧凑,但我将其作为读者的练习:)
Here's a quick hack:
which produces:
Not properly tested, use at your own risk!
And yes, the generated regex could be written more compact in many cases, but I leave that as an exercise for the reader :)
对于像我一样正在寻找上面伟大的 @Bart Kiers 作品的 javascript 版本的人
For anyone else who, like me, was looking for the javascript version of the great @Bart Kiers's production above
的 PHP 端口
RegexNumericRangeGenerator测试结果
PHP Port of RegexNumericRangeGenerator
Test Results
有理由必须是正则表达式吗?不能做这样的事情:
更新:
有一个简单但丑陋的方法可以使用 range 来做到这一点:
Is there a reason it has to be regex? can not do some thing like this:
Update:
There is an easy but ugly way to do it using range:
请小心,优秀的 @Bart Kiers 代码(以及 Travis J 的 JS 版本)在某些情况下会失败。例如:
不匹配“1229”、“1115”、“1[0-2][0-2][5-9]”
Be careful, the excelent @Bart Kiers's code (and JS version of Travis J) in some cases it fails. For example:
does not match "1229", "1115", "1[0-2][0-2][5-9]"
这实际上已经完成了。
请查看此网站。它包含一个 python 脚本的链接,该脚本会自动为您生成这些正则表达式。
That actually has been done already.
Have a look at this site. It contains a link to a python script that generates these regex's for you automagically.
continue
This answer is duplicated from this question. I've also made it into a blog post
Using regular expressions to validate a numeric range
To be clear: When a simple if statement will suffice
using regular expressions for validating numeric ranges is not recommended.
In addition, since regular expressions analyze strings, numbers must first be translated to a string before they can be tested (an exception is when the number happens to already be a string, such as when getting user input from the console).
(To ensure the string is a number to begin with, you could use
org.apache.commons.lang3.math.NumberUtils#isNumber(s)
)Despite this, figuring out how to validate number ranges with regular expressions is interesting and instructive.
A one number range
Rule: A number must be exactly
15
.The simplest range there is. A regex to match this is
Word boundaries are necessary to avoid matching the
15
inside of8215242
.A two number range
The rule: The number must be between
15
and16
. Three possible regexes:A number range "mirrored" around zero
The rule: The number must be between
-12
and12
.Here is a regex for
0
through12
, positive-only:Free-spaced:
Making this work for both negative and positive is as simple as adding an optional dash at the start:
(This assumes no inappropriate characters precede the dash.)
To forbid negative numbers, a negative lookbehind is necessary:
Leaving the lookbehind out would cause the
11
in-11
to match. (The first example in this post should have this added.)Note:
\d
versus[0-9]
In order to be compatible with all regex flavors, all
\d
-s should be changed to[0-9]
. For example, .NET considers non ASCII numbers, such as those in different languages, as legal values for\d
. Except for in the last example, for brevity, it's left as\d
.(With thanks to TimPietzcker at stackoverflow)
Three digits, with all but the first digit equal to zero
Rule: Must be between
0
and400
.A possible regex:
Free spaced:
Another possibility that should never be used:
Final example: Four digits, mirrored around zero, that does not end with zeros.
Rule: Must be between
-2055
and2055
This is from a question on stackoverflow.
Regex:
Free-spaced:
Here is a visual representation of this regex:
And here you can try it out yourself: Debuggex demonstration
(With thanks to PlasmaPower on stackoverflow for the debugging assistance.)
Final note
Depending on what you are capturing, it is likely that all sub-groups should be made into non-capture groups. For example, this:
Instead of this:
Example Java implementation
Output
我已将 Bart Kiers 的答案转换为 C++。该函数将两个整数作为输入并生成数字范围的正则表达式。
I've converted Bart Kiers's answer into C++. The function takes two integers as an input and generates the regular expression for the number range.
因为我遇到同样的问题@EmilianoT 已经 报告,我尝试修复它,但最终我选择移植 rel="nofollow noreferrer">RegexNumericRangeGenerator (由 @EmilianoT 移植),尽管不在一个类中。我对这个 JS 端口不太满意,因为所有
toString()
和parseInt()
方法仍然可以优化(它们可能在不必要的地方),但它适用于所有情况。我改变的是参数。我用
parse(min, max, width = 0, prefix = '' 替换了
,这给了它更多选项(有些人可能希望将正则表达式放入斜杠中,其他人希望匹配该行[parse($min, $max, $MatchWholeWord = FALSE, $MatchWholeLine = FALSE, $MatchLeadingZero = FALSE)
, suffix = '')前缀 = '^'; 后缀 = '$'
] 等)。我还希望能够配置数字的宽度(width = 3
→000
、001
、052
>、800
、1000
、...)。我替换了之前的答案,因为它并不总是有效。如果有人想阅读它,他们可以在答案历史记录中看到它。
As I have encountered the same issue as @EmilianoT already reported, I tried to fix it, but in the end I opted for porting the PHP port of RegexNumericRangeGenerator (ported by @EmilianoT), although not in a class. I am not quite happy with this JS port, as all
toString()
andparseInt()
methods could be still optimised (they might be somewhere unnecessary), but it works for all cases.I thing I changed are the parameters. I replaced
parse($min, $max, $MatchWholeWord = FALSE, $MatchWholeLine = FALSE, $MatchLeadingZero = FALSE)
withparse(min, max, width = 0, prefix = '', suffix = '')
, which gives it more options (some might want to put the regex into slashes, others want to match the line [prefix = '^'; suffix = '$'
], etc). Also I wanted to be able to configure the width of the number (width = 3
→000
,001
,052
,800
,1000
, ...).I replaced my previous answer, as it does not work all the time. If one wants to read it, they can see it in the answer history.