Mathematica 中的错误:正则表达式应用于很长的字符串

发布于 2024-08-21 14:23:15 字数 791 浏览 4 评论 0原文

在下面的代码中,如果字符串 s 附加到大约 10 或 20,000 个字符,Mathematica 内核段错误。

s = "This is the first line.
MAGIC_STRING
Everything after this line should get removed.
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
...";

s = StringReplace[s, RegularExpression@"(^|\\n)[^\\n]*MAGIC_STRING(.|\\n)*"->""]

我认为这主要是 Mathematica 的错误,我已经提交了一份错误报告,如果得到回复,我会在这里跟进。但我也想知道我是否以一种愚蠢/低效的方式这样做。即使没有,解决 Mathematica 错误的想法也将受到赞赏。

In the following code, if the string s is appended to be something like 10 or 20 thousand characters, the Mathematica kernel seg faults.

s = "This is the first line.
MAGIC_STRING
Everything after this line should get removed.
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
...";

s = StringReplace[s, RegularExpression@"(^|\\n)[^\\n]*MAGIC_STRING(.|\\n)*"->""]

I think this is primarily Mathematica's fault and I've submitted a bug report and will follow up here if I get a response. But I'm also wondering if I'm doing this in a stupid/inefficient way. And even if not, ideas for working around Mathematica's bug would be appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

2024-08-28 14:23:15

Mathematica 使用 PCRE 语法,因此它确实有 /s 又名 DOTALL 又名 Singleline 修饰符,您只需在您希望其应用的表达式的一部分。

请参阅此处的正则表达式文档:(展开标有“更多信息”的部分)
http://reference.wolfram.com/mathematica/ref/RegularExpression.html

为其后的所有正则表达式元素设置以下选项:
(?i) 将大写和小写视为等效(忽略大小写)
(?m) 使 ^ 和 $ 匹配行的开头和结尾(多行模式)
(?s) 允许 .匹配换行符
(?-c) 取消设置选项

这个修改后的输入不会使 Mathematica 7.0.1 崩溃(原来的),使用 15,000 个字符长的字符串,产生与表达式相同的输出:

s = StringReplace[s,RegularExpression@".*MAGIC_STRING(?s).*"->""]

由于 @AlanMoore 解释的原因,它也应该更快一点

Mathematica uses PCRE syntax, so it does have the /s aka DOTALL aka Singleline modifier, you just prepend the (?s) modifier before the part of the expression in which you want it to apply.

See the RegularExpression documentation here: (expand the section labeled "More Information")
http://reference.wolfram.com/mathematica/ref/RegularExpression.html

The following set options for all regular expression elements that follow them:
(?i) treat uppercase and lowercase as equivalent (ignore case)
(?m) make ^ and $ match start and end of lines (multiline mode)
(?s) allow . to match newline
(?-c) unset options

This modified input doesn't crash Mathematica 7.0.1 for me (the original did), using a string that is 15,000 characters long, producing the same output as your expression:

s = StringReplace[s,RegularExpression@".*MAGIC_STRING(?s).*"->""]

It should also be a bit faster for the reasons @AlanMoore explained

执手闯天涯 2024-08-28 14:23:15

优化正则表达式的最佳方法取决于 Mathematica 正则表达式引擎的内部结构,但我肯定会摆脱 (.|\\n)*,正如 @Simon 提到的。这不仅仅是交替——尽管在交替中每个替代都与一个字符完全匹配几乎总是一个错误;这就是字符类的用途。但是,当您匹配每个字符时(因为括号),您也会捕获每个字符,只是在匹配下一个字符时将其丢弃。

快速扫描 Mathematica 正则表达式文档不会发现任何类似 /s (单行或 DOTALL)修饰符的内容,因此我推荐旧的 JavaScript 备用,[\\s\\S ]* -- 匹配任何空格或不是空格的内容。另外,将 $ 锚点添加到正则表达式的末尾可能会有所帮助:

"(^|\\n)[^\\n]*MAGIC_STRING[\\s\\S]*$"

但最好的选择可能是根本不使用正则表达式。我在这里没有看到任何需要它们的东西,并且使用 Mathematica 的普通字符串操作函数可能会更容易且更有效。

The best way to optimize the regex depends on the internals of Mathematica's regex engine, but I would definitely get rid of the (.|\\n)*, as @Simon mentioned. It's not just the alternation--although it's almost always a mistake to have an alternation in which every alternative matches exactly one character; that's what character classes are for. But you're also capturing each character when you match it (because of the parentheses), only to throw it away when you match the next character.

A quick scan of the Mathematica regex docs doesn't turn up anything like the /s (Singleline or DOTALL) modifier, so I recommend the old JavaScript standby, [\\s\\S]* -- match anything that is whitespace or anything that isn't whitespace. Also, it might help to add the $ anchor to the end of the regex:

"(^|\\n)[^\\n]*MAGIC_STRING[\\s\\S]*$"

But your best option would probably be not to use regexes at all. I don't see anything here that requires them, and it would probably be much easier as well as more efficient to use Mathematica's normal string-manipulation functions.

不念旧人 2024-08-28 14:23:15

Mathematica 是一个很棒的执行工具,但我建议不要尝试用它做任何严肃的事情,比如对长字符串的正则表达式或对大量数据的任何类型的计算(或者正确性很重要的地方)。使用经过尝试和测试的东西。 Visual F# 2010 需要 5 毫秒和一行代码才能获得正确答案而不崩溃:

> let str =
    "This is the first line.\nMAGIC_STRING\nEverything after this line should get removed." +
      String.replicate 2000 "0123456789";;
val str : string =
  "This is the first line.
MAGIC_STRING
Everything after this li"+[20022 chars]

> open System.Text.RegularExpressions;;
> #time;;
--> Timing now on

> (Regex "(^|\\n)[^\\n]*MAGIC_STRING(.|\\n)*").Replace(str, "");;
Real: 00:00:00.005, CPU: 00:00:00.015, GC gen0: 0, gen1: 0, gen2: 0
val it : string = "This is the first line."

Mathematica is a great executive toy but I'd advise against trying to do anything serious with it like regexs over long strings or any kind of computation over significant amounts of data (or where correctness is important). Use something tried and tested. Visual F# 2010 takes 5 milliseconds and one line of code to get the correct answer without crashing:

> let str =
    "This is the first line.\nMAGIC_STRING\nEverything after this line should get removed." +
      String.replicate 2000 "0123456789";;
val str : string =
  "This is the first line.
MAGIC_STRING
Everything after this li"+[20022 chars]

> open System.Text.RegularExpressions;;
> #time;;
--> Timing now on

> (Regex "(^|\\n)[^\\n]*MAGIC_STRING(.|\\n)*").Replace(str, "");;
Real: 00:00:00.005, CPU: 00:00:00.015, GC gen0: 0, gen1: 0, gen2: 0
val it : string = "This is the first line."
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文