unicode 字符串上的 Mathematica 正则表达式

发布于 2024-08-26 10:10:39 字数 614 浏览 13 评论 0原文

这是一次令人着迷的调试体验。 你能看出下面两行的区别吗?

StringReplace["–", RegularExpression@"[\\s\\S]" -> "abc"]
StringReplace["-", RegularExpression@"[\\s\\S]" -> "abc"]

当你评估它们时,它们会做非常不同的事情。事实证明,这是因为第一行中被替换的字符串由 unicode 破折号组成,而不是第二行中的普通旧 ascii 破折号。

对于 unicode 字符串,正则表达式不匹配。 我的意思是正则表达式“[\s\S]”的意思是“匹配任何字符(包括换行符)”,但 Mathematica 显然将其视为“匹配任何 ascii 字符”。

如何修复正则表达式,使上面第一行的计算结果与第二行相同?或者,是否有一个 asciify 过滤器可以先应用于字符串?

PS:Mathematica 文档称其字符串模式匹配构建在 Perl 兼容正则表达式库之上 (http://pcre. org),所以我遇到的问题可能不是 Mathematica 特有的。

This was a fascinating debugging experience.
Can you spot the difference between the following two lines?

StringReplace["–", RegularExpression@"[\\s\\S]" -> "abc"]
StringReplace["-", RegularExpression@"[\\s\\S]" -> "abc"]

They do very different things when you evaluate them. It turns out it's because the string being replaced in the first line consists of a unicode en dash, as opposed to a plain old ascii dash in the second line.

In the case of the unicode string, the regular expression doesn't match.
I meant the regex "[\s\S]" to mean "match any character (including newline)" but Mathematica apparently treats it as "match any ascii character".

How can I fix the regular expression so the first line above evaluates the same as the second? Alternatively, is there an asciify filter I can apply to the strings first?

PS: The Mathematica documentation says that its string pattern matching is built on top of the Perl-Compatible Regular Expressions library (http://pcre.org) so the problem I'm having may not be specific to Mathematica.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

蓝眸 2024-09-02 10:10:39

这是我最初用作解决方法的 asciify 函数:

f[s_String] := s
f[x_] := FromCharacterCode[x]

asciify[s_String] := 
  StringJoin[f /@ (ToCharacterCode[s] /. x_?(#>255&) :> "&"<>ToString[x]<>";")]

然后我意识到,感谢 @Isaac 的回答,“。”作为正则表达式似乎没有这个 unicode 问题。我从 Mathematica 中的错误:正则表达式应用于非常长的字符串 的答案中了解到“(.| \n)" 是不明智的,但是 "(?s)"。推荐。所以我认为最好的解决方法如下:

StringReplace["–", RegularExpression@"(?s)." -> "abc"]

Here's an asciify function which I used as a workaround at first:

f[s_String] := s
f[x_] := FromCharacterCode[x]

asciify[s_String] := 
  StringJoin[f /@ (ToCharacterCode[s] /. x_?(#>255&) :> "&"<>ToString[x]<>";")]

Then I realized, thanks to @Isaac's answer, that "." as a regular expression doesn't seem to have this unicode problem. I learned from the answers to Bug in Mathematica: regular expression applied to very long string that "(.|\n)" is ill-advised but that "(?s)." is recommended. So I think the best fix is the following:

StringReplace["–", RegularExpression@"(?s)." -> "abc"]
∞梦里开花 2024-09-02 10:10:39

我将使用 StringExpression 代替 RegularExpression。这按预期工作:

f[s_String] := StringReplace[s, _ -> "abc"]

StringExpression 中,Blank[] 将匹配任何内容,包括非 ASCII 字符。

编辑以响应版本更新:从Mathematica 11.0.1开始,它看起来像字母字符,字符代码最大为2^16 - 1 (这被称为 FromCharacterCode),StringMatchQ[LetterCharacter] 的结果现在与 LetterQ 的结果匹配。

AllTrue[FromCharacterCode /@ Range[2^16 - 1], 
 LetterQ@# === StringMatchQ[#, LetterCharacter] &]
(* True *)

I would use a StringExpression in place of RegularExpression. This works as desired:

f[s_String] := StringReplace[s, _ -> "abc"]

In a StringExpression, Blank[] will match anything, including non-ASCII characters.

EDIT in response to version updates: as of Mathematica 11.0.1, it looks like letter characters with character codes up to 2^16 - 1 (which is called out as the maximum value for FromCharacterCode), the results of StringMatchQ[LetterCharacter] now match those of LetterQ.

AllTrue[FromCharacterCode /@ Range[2^16 - 1], 
 LetterQ@# === StringMatchQ[#, LetterCharacter] &]
(* True *)
尴尬癌患者 2024-09-02 10:10:39

使用 "(.|\n)" 作为正则表达式的输入似乎对我有用。该模式匹配 . (任何非换行符)或 \n (换行符)。

Using "(.|\n)" for the input to RegularExpression seems to work for me. The pattern matches . (any non-newline character) or \n (a newline character).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文