正则表达式替换 : 到“:” ETC
我有一堆字符串,例如:
"Hello, here's a test colon:. Here's a test semi-colon;"
我想将其替换为
"Hello, here's a test colon:. Here's a test semi-colon;"
等等所有 printable ASCII 值。
目前我正在使用 < code>boost::regex_search 来匹配 &#(\d+);
,在我依次处理每个匹配项时构建一个字符串(包括附加不包含任何内容的子字符串)自我找到的上次匹配以来的匹配)。
谁能想到更好的方法吗? 我对非正则表达式方法持开放态度,但在这种情况下,正则表达式似乎是一种相当合理的方法。
谢谢,
多姆
I've got a bunch of strings like:
"Hello, here's a test colon:. Here's a test semi-colon;"
I would like to replace that with
"Hello, here's a test colon:. Here's a test semi-colon;"
And so on for all printable ASCII values.
At present I'm using boost::regex_search
to match (\d+);
, building up a string as I process each match in turn (including appending the substring containing no matches since the last match I found).
Can anyone think of a better way of doing it? I'm open to non-regex methods, but regex seemed a reasonably sensible approach in this case.
Thanks,
Dom
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
这是使用 Flex 创建的 NCR 扫描器:
要制作可执行文件:
示例:
它打印非递归版本:
递归版本生成:
Here's a NCR scanner created using Flex:
To make an executable:
Example:
It prints for non-recursive version:
And the recursive one produces:
这是原始问题陈述显然不太完整的情况之一,但如果您确实只想在产生 32 到 126 之间字符的情况下触发,那么这对我之前发布的解决方案来说是一个微不足道的更改。 请注意,我的解决方案还处理多种模式的情况(尽管第一个版本不会处理某些相邻模式在范围内而其他模式不在范围内的情况)。
处理这种情况并不是特别困难(例如 ;#131;#58; 也会产生“;#131;:”:
This is one of those cases where the original problem statement apparently isn't very complete, it seems, but if you really want to only trigger on cases which produce characters between 32 and 126, that's a trivial change to the solution I posted earlier. Note that my solution also handles the multiple-patterns case (although this first version wouldn't handle cases where some of the adjacent patterns are in-range and others are not).
It would not be particularly difficult to handle that case (e.g. ;#131;#58; produces ";#131;:" as well:
这是基于 < 的版本代码>boost::regex_token_iterator。 该程序将从
stdin
读取的十进制 NCR 替换为相应的 ASCII 字符并将它们打印到stdout
。Here's a version based on
boost::regex_token_iterator
. The program replaces decimal NCRs read fromstdin
by corresponding ASCII characters and prints them tostdout
.使用正则表达式的一大优点是处理诸如
&
之类的棘手情况。实体替换不是迭代的,而是单个步骤。 正则表达式也将相当高效:两个前导字符是固定的,因此它将快速跳过任何不以&#
开头的内容。 最后,正则表达式解决方案对于未来的维护人员来说不会有太多惊喜。我想说正则表达式是正确的选择。
但这是最好的正则表达式吗? 您知道您需要两个数字,如果您有 3 个数字,第一个数字将是 1。毕竟可打印 ASCII 是
-~
。 因此,您可以考虑?\d\d;
。至于替换内容,我会使用 boost::regex::replace 描述的基本算法:
The big advantage of using a regex is to deal with the tricky cases like
&
Entity replacement isn't iterative, it's a single step. The regex is also going to be fairly efficient: the two lead characters are fixed, so it will quickly skip anything not starting with&#
. Finally, the regex solution is one without a lot of surprises for future maintainers.I'd say a regex was the right choice.
Is it the best regex, though? You know you need two digits and if you have 3 digits, the first one will be a 1. Printable ASCII is after all
-~
. For that reason, you could consider?\d\d;
.As for replacing the content, I'd use the basic algorithm described for boost::regex::replace :
这可能会给我带来一些反对票,因为这不是 C++、Boost 或正则表达式响应,但这是一个 SNOBOL 解决方案。 这个适用于 ASCII。 我正在为 Unicode 做一些事情。
This will probably earn me some down votes, seeing as this is not a c++, boost or regex response, but here's a SNOBOL solution. This one works for ASCII. Am working on something for Unicode.
我不知道 boost 中的正则表达式支持,但检查它是否有一个支持回调或 lambda 或类似的替换()方法。 我想说,这是使用其他语言的正则表达式执行此操作的常用方法。
这是一个 Python 实现:
生成:
我现在已经查看了一些 boost,我发现它有一个 regex_replace 函数。 但 C++ 真的让我很困惑,所以我不知道是否可以对替换部分使用回调。 但如果我正确阅读了 boost 文档,则 (\d\d) 组匹配的字符串应该在 $1 中可用。 如果我使用 boost,我会检查一下。
I don't know about the regex support in boost, but check if it has a replace() method that supports callbacks or lambdas or some such. That's the usual way to do this with regexes in other languages I'd say.
Here's a Python implementation:
Producing:
I've looked some at boost now and I see it has a regex_replace function. But C++ really confuses me so I can't figure out if you could use a callback for the replace part. But the string matched by the (\d\d) group should be available in $1 if I read the boost docs correctly. I'd check it out if I were using boost.
现有的SNOBOL解决方案不能正确处理多模式的情况,因为只有一个“&”。 以下解决方案应该效果更好:
The existing SNOBOL solutions don't handle the multiple-patterns case properly, due to there only being one "&". The following solution ought to work better:
你知道,只要我们偏离主题,perl 替换就有一个“e”选项。 就像评估表达式一样。 例如
地打印:
虽然 perl 就是 perl,但我确信有更好的方法来编写它......
回到 C 代码:
您也可以推出自己的有限状态机。 但这会变得混乱并且以后维护起来很麻烦。
Ya know, as long as we're off topic here, perl substitution has an 'e' option. As in evaluate expression. E.g.
Pretty-printing that:
Though perl being perl, I'm sure there's a much better way to write that...
Back to C code:
You could also roll your own finite state machine. But that gets messy and troublesome to maintain later on.
这是另一个 Perl 的单行代码(参见 @mrree 的回答):
Here's another Perl's one-liner (see @mrree's answer):
boost::spirit 解析器生成器框架允许轻松创建一个解析器来转换所需的 NCR。
boost::spirit parser generator framework allows easily to create a parser that transforms desirable NCRs.
我确实认为我很擅长正则表达式,但我从未见过在正则表达式中使用 lambda,请赐教!
我目前正在使用 python,并且可以用这个 oneliner 解决它:
这有什么意义吗?
I did think I was pretty good at regex but I have never seen lambdas been used in regex, please enlighten me!
I'm currently using python and would have solved it with this oneliner:
Does that make any sense?