unicode 字符串上的 Mathematica 正则表达式

发布于 2024-08-26 10:10:39 字数 614 浏览 17 评论 0原文

这是一次令人着迷的调试体验。你能看出下面两行的区别吗？

StringReplace["–", RegularExpression@"[\\s\\S]" -> "abc"]
StringReplace["-", RegularExpression@"[\\s\\S]" -> "abc"]

当你评估它们时，它们会做非常不同的事情。事实证明，这是因为第一行中被替换的字符串由 unicode 破折号组成，而不是第二行中的普通旧 ascii 破折号。

对于 unicode 字符串，正则表达式不匹配。我的意思是正则表达式“[\s\S]”的意思是“匹配任何字符（包括换行符）”，但 Mathematica 显然将其视为“匹配任何 ascii 字符”。

如何修复正则表达式，使上面第一行的计算结果与第二行相同？或者，是否有一个 asciify 过滤器可以先应用于字符串？

PS：Mathematica 文档称其字符串模式匹配构建在 Perl 兼容正则表达式库之上 (http://pcre. org），所以我遇到的问题可能不是 Mathematica 特有的。

原文

This was a fascinating debugging experience.
Can you spot the difference between the following two lines?

StringReplace["–", RegularExpression@"[\\s\\S]" -> "abc"]
StringReplace["-", RegularExpression@"[\\s\\S]" -> "abc"]

They do very different things when you evaluate them. It turns out it's because the string being replaced in the first line consists of a unicode en dash, as opposed to a plain old ascii dash in the second line.

In the case of the unicode string, the regular expression doesn't match.
I meant the regex "[\s\S]" to mean "match any character (including newline)" but Mathematica apparently treats it as "match any ascii character".

How can I fix the regular expression so the first line above evaluates the same as the second? Alternatively, is there an asciify filter I can apply to the strings first?

PS: The Mathematica documentation says that its string pattern matching is built on top of the Perl-Compatible Regular Expressions library (http://pcre.org) so the problem I'm having may not be specific to Mathematica.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蓝眸 2024-09-02 10:10:39

这是我最初用作解决方法的 asciify 函数：

f[s_String] := s
f[x_] := FromCharacterCode[x]

asciify[s_String] := 
  StringJoin[f /@ (ToCharacterCode[s] /. x_?(#>255&) :> "&"<>ToString[x]<>";")]

然后我意识到，感谢 @Isaac 的回答，“。”作为正则表达式似乎没有这个 unicode 问题。我从 Mathematica 中的错误：正则表达式应用于非常长的字符串的答案中了解到“(.| \n)" 是不明智的，但是 "(?s)"。推荐。所以我认为最好的解决方法如下：

StringReplace["–", RegularExpression@"(?s)." -> "abc"]

Here's an asciify function which I used as a workaround at first:

f[s_String] := s
f[x_] := FromCharacterCode[x]

asciify[s_String] := 
  StringJoin[f /@ (ToCharacterCode[s] /. x_?(#>255&) :> "&"<>ToString[x]<>";")]

Then I realized, thanks to @Isaac's answer, that "." as a regular expression doesn't seem to have this unicode problem. I learned from the answers to Bug in Mathematica: regular expression applied to very long string that "(.|\n)" is ill-advised but that "(?s)." is recommended. So I think the best fix is the following:

StringReplace["–", RegularExpression@"(?s)." -> "abc"]

回复收藏 0 原文

∞梦里开花 2024-09-02 10:10:39

我将使用 StringExpression 代替 RegularExpression。这按预期工作：

f[s_String] := StringReplace[s, _ -> "abc"]

在 StringExpression 中，Blank[] 将匹配任何内容，包括非 ASCII 字符。

编辑以响应版本更新：从Mathematica 11.0.1开始，它看起来像字母字符，字符代码最大为2^16 - 1 （这被称为 FromCharacterCode)，StringMatchQ[LetterCharacter] 的结果现在与 LetterQ 的结果匹配。

AllTrue[FromCharacterCode /@ Range[2^16 - 1], 
 LetterQ@# === StringMatchQ[#, LetterCharacter] &]
(* True *)

I would use a StringExpression in place of RegularExpression. This works as desired:

f[s_String] := StringReplace[s, _ -> "abc"]

In a StringExpression, Blank[] will match anything, including non-ASCII characters.

EDIT in response to version updates: as of Mathematica 11.0.1, it looks like letter characters with character codes up to 2^16 - 1 (which is called out as the maximum value for FromCharacterCode), the results of StringMatchQ[LetterCharacter] now match those of LetterQ.

AllTrue[FromCharacterCode /@ Range[2^16 - 1], 
 LetterQ@# === StringMatchQ[#, LetterCharacter] &]
(* True *)

回复收藏 0 原文

尴尬癌患者 2024-09-02 10:10:39

使用 "(.|\n)" 作为正则表达式的输入似乎对我有用。该模式匹配 . （任何非换行符）或 \n （换行符）。

回复收藏 0 原文

~没有更多了~

关于作者

梨涡少年

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

unicode 字符串上的 Mathematica 正则表达式

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

5040234068

樱花雨梦

≈。彩虹

雨轻弹

血之狂魔

qq_0bIjwE

友情链接

unicode 字符串上的 Mathematica 正则表达式

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

5040234068

樱花雨梦

≈。彩虹

雨轻弹

血之狂魔

qq_0bIjwE

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。