正则表达式、编码和看起来相似的字符
首先,举一个简单的例子,假设我有这个 /[0-9]{2}°/
RegEx 和这个文本“24°”。文本显然不匹配......(?)真的,这取决于字体。
这是我的问题,我无法控制用户使用哪些字符,因此,我需要涵盖正则表达式 /[0-9]{2}[°°]/
中的所有可能性,或者更好的是,确保文本仅包含我期望的字符°
。但我不能只删除未知的字符,否则正则表达式将无法工作,我需要将其更改为看起来像我期待的字符。我已经通过一个小函数来完成此操作,该函数将“看起来像”映射到“我期望的”并更改它,问题是,我没有涵盖所有可能性,例如,今天我发现了一个新的 -,现在我们得到了三个,就像 Latex =D
-
--
---
,很酷,但是正则表达式没有'不工作。
有谁知道我该如何解决这个问题?
First, a brief example, let's say I have this /[0-9]{2}°/
RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.
Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/
, or even better, assure that the text has only the chars I'm expecting °
. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -
, now we got three of them, just like latex =D -
--
---
,cool , but the regex didn't work.
Does anyone knows how I might solve this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
没有办法在正则表达式中包含具有“相似外观”的字符,所以基本上不能。
对于特定字符,您可能会幸运地使用 Unicode 规范,其中可能会列出一些最常见的错误,但您不能保证。对于度数符号,Unicode 代码表列出了四个相似的字符(\u02da、\u030a、\u2070 和 \u2218),但没有列出有问题的字符(阳性序数指示符)。
There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.
For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.
不幸的是 PHP 中没有。 ASP.NET 有涵盖此类内容的 unicode 字符类,但正如您所见 这里,:所以涵盖太多了。而且因为它不是 PHP 无论如何也没有帮助。 :)
在 PHP 中,您将只能选择最常见的字符集并使用它们。
这应该有帮助:
http://unicode.org/charts/charindex.html
只有一个度数符号。使用看起来相似的东西是不正确的。还有华氏度和摄氏度的符号。不幸的是,有很多负号。
Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)
In PHP you are going to be limited to selecting the most common character sets and using them.
This should help:
http://unicode.org/charts/charindex.html
There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.
您的正则表达式确实需要列出您想要接受的所有字符。如果您无法提前知道字符串的编码,则可以使用 PHP 中的 /u 修饰符将正则表达式指定为 UTF-8:
"/[0-9]{2}[°°]/u "
然后您可以在字符类中包含您想要接受的所有 Unicode 字符。在使用正则表达式之前,您还需要将主题字符串转换为 UTF-8。Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP:
"/[0-9]{2}[°º]/u"
Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.我刚刚偶然发现了这个问题的良好参考资料:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt" unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
I just stumbled into good references for this question:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
https://www.rfc-editor.org/rfc/rfc3454.html
好吧,如果你想提高温度,你可能需要首先改变一些东西。
温度可以是 1 到 3 位数字,因此
[0-9]{1,3}
(如果有人实际上还活着输入四位数温度,那么我们都注定要失败!)对你来说更准确。现在,正如您所发现的,度数符号是棘手的部分。如果你无法控制用户(更可惜),你能直接拉接下来发生的事情吗?
您可能需要通过一些位置处理(例如字符串的开头或结尾)来加强第一部分。
您还可以排除所有不需要的常规字符。
这将拾取所有标点符号(尽管只有一个)。
Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.
temperatures can come in 1 to 3 digits so
[0-9]{1,3}
(and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?
You might have to beef up the first part though with a little position handling like beginning of the string or end.
You may also exclude all the regular characters you don't want.
That will pick up all the punctuation marks (only one though).