unicode 字符串上的 Mathematica 正则表达式
这是一次令人着迷的调试体验。 你能看出下面两行的区别吗?
StringReplace["–", RegularExpression@"[\\s\\S]" -> "abc"]
StringReplace["-", RegularExpression@"[\\s\\S]" -> "abc"]
当你评估它们时,它们会做非常不同的事情。事实证明,这是因为第一行中被替换的字符串由 unicode 破折号组成,而不是第二行中的普通旧 ascii 破折号。
对于 unicode 字符串,正则表达式不匹配。 我的意思是正则表达式“[\s\S]”的意思是“匹配任何字符(包括换行符)”,但 Mathematica 显然将其视为“匹配任何 ascii 字符”。
如何修复正则表达式,使上面第一行的计算结果与第二行相同?或者,是否有一个 asciify 过滤器可以先应用于字符串?
PS:Mathematica 文档称其字符串模式匹配构建在 Perl 兼容正则表达式库之上 (http://pcre. org),所以我遇到的问题可能不是 Mathematica 特有的。
This was a fascinating debugging experience.
Can you spot the difference between the following two lines?
StringReplace["–", RegularExpression@"[\\s\\S]" -> "abc"]
StringReplace["-", RegularExpression@"[\\s\\S]" -> "abc"]
They do very different things when you evaluate them. It turns out it's because the string being replaced in the first line consists of a unicode en dash, as opposed to a plain old ascii dash in the second line.
In the case of the unicode string, the regular expression doesn't match.
I meant the regex "[\s\S]" to mean "match any character (including newline)" but Mathematica apparently treats it as "match any ascii character".
How can I fix the regular expression so the first line above evaluates the same as the second? Alternatively, is there an asciify filter I can apply to the strings first?
PS: The Mathematica documentation says that its string pattern matching is built on top of the Perl-Compatible Regular Expressions library (http://pcre.org) so the problem I'm having may not be specific to Mathematica.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是我最初用作解决方法的 asciify 函数:
然后我意识到,感谢 @Isaac 的回答,“。”作为正则表达式似乎没有这个 unicode 问题。我从 Mathematica 中的错误:正则表达式应用于非常长的字符串 的答案中了解到“(.| \n)" 是不明智的,但是 "(?s)"。推荐。所以我认为最好的解决方法如下:
Here's an asciify function which I used as a workaround at first:
Then I realized, thanks to @Isaac's answer, that "." as a regular expression doesn't seem to have this unicode problem. I learned from the answers to Bug in Mathematica: regular expression applied to very long string that "(.|\n)" is ill-advised but that "(?s)." is recommended. So I think the best fix is the following:
我将使用
StringExpression
代替RegularExpression
。这按预期工作:在
StringExpression
中,Blank[]
将匹配任何内容,包括非 ASCII 字符。编辑以响应版本更新:从Mathematica 11.0.1开始,它看起来像字母字符,字符代码最大为
2^16 - 1
(这被称为FromCharacterCode
),StringMatchQ[LetterCharacter]
的结果现在与LetterQ
的结果匹配。I would use a
StringExpression
in place ofRegularExpression
. This works as desired:In a
StringExpression
,Blank[]
will match anything, including non-ASCII characters.EDIT in response to version updates: as of Mathematica 11.0.1, it looks like letter characters with character codes up to
2^16 - 1
(which is called out as the maximum value forFromCharacterCode
), the results ofStringMatchQ[LetterCharacter]
now match those ofLetterQ
.使用
"(.|\n)"
作为正则表达式的输入似乎对我有用。该模式匹配.
(任何非换行符)或\n
(换行符)。Using
"(.|\n)"
for the input to RegularExpression seems to work for me. The pattern matches.
(any non-newline character) or\n
(a newline character).