正则表达式替换 : 到“:” ETC

发布于 2024-07-12 01:10:50 字数 672 浏览 9 评论 0原文

我有一堆字符串,例如:

"Hello, here's a test colon:. Here's a test semi-colon&#59;"

我想将其替换为

"Hello, here's a test colon:. Here's a test semi-colon;"

等等所有 printable ASCII 值

目前我正在使用 < code>boost::regex_search 来匹配 &#(\d+);,在我依次处理每个匹配项时构建一个字符串(包括附加不包含任何内容的子字符串)自我找到的上次匹配以来的匹配)。

谁能想到更好的方法吗? 我对非正则表达式方法持开放态度,但在这种情况下,正则表达式似乎是一种相当合理的方法。

谢谢,

多姆

I've got a bunch of strings like:

"Hello, here's a test colon:. Here's a test semi-colon;"

I would like to replace that with

"Hello, here's a test colon:. Here's a test semi-colon;"

And so on for all printable ASCII values.

At present I'm using boost::regex_search to match &#(\d+);, building up a string as I process each match in turn (including appending the substring containing no matches since the last match I found).

Can anyone think of a better way of doing it? I'm open to non-regex methods, but regex seemed a reasonably sensible approach in this case.

Thanks,

Dom

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

猫瑾少女 2024-07-19 01:10:51

这是使用 Flex 创建的 NCR 扫描器:

/** ncr2a.y: Replace all NCRs by corresponding printable ASCII characters. */
%%
&#(1([01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]); { /* accept 32..126 */
  /**recursive: unput(atoi(yytext + 2)); skip '&#'; `atoi()` ignores ';' */
  fputc(atoi(yytext + 2), yyout); /* non-recursive version */
}

要制作可执行文件:

$ flex ncr2a.y
$ gcc -o ncr2a lex.yy.c -lfl

示例:

$ echo "Hello,  here's a test colon:. 
> Here's a test semi-colon; 'ƒ'
> &#59; <-- may be recursive" \
> | ncr2a

它打印非递归版本:

Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
; <-- may be recursive

递归版本生成:

Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
; <-- may be recursive

Here's a NCR scanner created using Flex:

/** ncr2a.y: Replace all NCRs by corresponding printable ASCII characters. */
%%
&#(1([01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]); { /* accept 32..126 */
  /**recursive: unput(atoi(yytext + 2)); skip '&#'; `atoi()` ignores ';' */
  fputc(atoi(yytext + 2), yyout); /* non-recursive version */
}

To make an executable:

$ flex ncr2a.y
$ gcc -o ncr2a lex.yy.c -lfl

Example:

$ echo "Hello,  here's a test colon:. 
> Here's a test semi-colon; 'ƒ'
> &#59; <-- may be recursive" \
> | ncr2a

It prints for non-recursive version:

Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
; <-- may be recursive

And the recursive one produces:

Hello,  here's a test colon:.
Here's a test semi-colon; 'ƒ'
; <-- may be recursive
乜一 2024-07-19 01:10:51

这是原始问题陈述显然不太完整的情况之一,但如果您确实只想在产生 32 到 126 之间字符的情况下触发,那么这对我之前发布的解决方案来说是一个微不足道的更改。 请注意,我的解决方案还处理多种模式的情况(尽管第一个版本不会处理某些相邻模式在范围内而其他模式不在范围内的情况)。

      dd = "0123456789"
      ccp = "#" span(dd) $ n *lt(n,127) *ge(n,32) ";" *?(s = s char(n))
 +      fence (*ccp | null)
 rdl  line = input                              :f(done)
 repl line "&" *?(s = ) ccp = s                 :s(repl)
      output = line                             :(rdl)
 done
 end

处理这种情况并不是特别困难(例如 ;#131;#58; 也会产生“;#131;:”:

      dd = "0123456789"
      ccp = "#" (span(dd) $ n ";") $ enc
 +      *?(s = s (lt(n,127) ge(n,32) char(n), char(10) enc))
 +      fence (*ccp | null)
 rdl  line = input                              :f(done)
 repl line "&" *?(s = ) ccp = s                 :s(repl)
      output = replace(line,char(10),"#")       :(rdl)
 done
 end

This is one of those cases where the original problem statement apparently isn't very complete, it seems, but if you really want to only trigger on cases which produce characters between 32 and 126, that's a trivial change to the solution I posted earlier. Note that my solution also handles the multiple-patterns case (although this first version wouldn't handle cases where some of the adjacent patterns are in-range and others are not).

      dd = "0123456789"
      ccp = "#" span(dd) $ n *lt(n,127) *ge(n,32) ";" *?(s = s char(n))
 +      fence (*ccp | null)
 rdl  line = input                              :f(done)
 repl line "&" *?(s = ) ccp = s                 :s(repl)
      output = line                             :(rdl)
 done
 end

It would not be particularly difficult to handle that case (e.g. ;#131;#58; produces ";#131;:" as well:

      dd = "0123456789"
      ccp = "#" (span(dd) $ n ";") $ enc
 +      *?(s = s (lt(n,127) ge(n,32) char(n), char(10) enc))
 +      fence (*ccp | null)
 rdl  line = input                              :f(done)
 repl line "&" *?(s = ) ccp = s                 :s(repl)
      output = replace(line,char(10),"#")       :(rdl)
 done
 end
兮子 2024-07-19 01:10:51

这是基于 < 的版本代码>boost::regex_token_iterator。 该程序将从 stdin 读取的十进制 NCR 替换为相应的 ASCII 字符并将它们打印到stdout

#include <cassert>
#include <iostream>
#include <string>
#include <boost/lexical_cast.hpp>
#include <boost/regex.hpp>

int main()
{
  boost::regex re("&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);"); // 32..126
  const int subs[] = {-1, 1}; // non-match & subexpr
  boost::sregex_token_iterator end;
  std::string line;

  while (std::getline(std::cin, line)) {
    boost::sregex_token_iterator tok(line.begin(), line.end(), re, subs);

    for (bool isncr = false; tok != end; ++tok, isncr = !isncr) {
      if (isncr) { // convert NCR e.g., ':' -> ':'
        const int d = boost::lexical_cast<int>(*tok);
        assert(32 <= d && d < 127);
        std::cout << static_cast<char>(d);
      }
      else
        std::cout << *tok; // output as is
    }
    std::cout << '\n';
  }
}

Here's a version based on boost::regex_token_iterator. The program replaces decimal NCRs read from stdin by corresponding ASCII characters and prints them to stdout.

#include <cassert>
#include <iostream>
#include <string>
#include <boost/lexical_cast.hpp>
#include <boost/regex.hpp>

int main()
{
  boost::regex re("&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);"); // 32..126
  const int subs[] = {-1, 1}; // non-match & subexpr
  boost::sregex_token_iterator end;
  std::string line;

  while (std::getline(std::cin, line)) {
    boost::sregex_token_iterator tok(line.begin(), line.end(), re, subs);

    for (bool isncr = false; tok != end; ++tok, isncr = !isncr) {
      if (isncr) { // convert NCR e.g., ':' -> ':'
        const int d = boost::lexical_cast<int>(*tok);
        assert(32 <= d && d < 127);
        std::cout << static_cast<char>(d);
      }
      else
        std::cout << *tok; // output as is
    }
    std::cout << '\n';
  }
}
行雁书 2024-07-19 01:10:50

使用正则表达式的一大优点是处理诸如 &#38; 之类的棘手情况。实体替换不是迭代的,而是单个步骤。 正则表达式也将相当高效:两个前导字符是固定的,因此它将快速跳过任何不以 &# 开头的内容。 最后,正则表达式解决方案对于未来的维护人员来说不会有太多惊喜。

我想说正则表达式是正确的选择。

但这是最好的正则表达式吗? 您知道您需要两个数字,如果您有 3 个数字,第一个数字将是 1。毕竟可打印 ASCII 是 -~。 因此,您可以考虑 ?\d\d;

至于替换内容,我会使用 boost::regex::replace 描述的基本算法

For each match // Using regex_iterator<>
    Print the prefix of the match
    Remove the first 2 and last character of the match (&#;)
    lexical_cast the result to int, then truncate to char and append.

Print the suffix of the last match.

The big advantage of using a regex is to deal with the tricky cases like &#38; Entity replacement isn't iterative, it's a single step. The regex is also going to be fairly efficient: the two lead characters are fixed, so it will quickly skip anything not starting with &#. Finally, the regex solution is one without a lot of surprises for future maintainers.

I'd say a regex was the right choice.

Is it the best regex, though? You know you need two digits and if you have 3 digits, the first one will be a 1. Printable ASCII is after all -~. For that reason, you could consider ?\d\d;.

As for replacing the content, I'd use the basic algorithm described for boost::regex::replace :

For each match // Using regex_iterator<>
    Print the prefix of the match
    Remove the first 2 and last character of the match (&#;)
    lexical_cast the result to int, then truncate to char and append.

Print the suffix of the last match.
画离情绘悲伤 2024-07-19 01:10:50

这可能会给我带来一些反对票,因为这不是 C++、Boost 或正则表达式响应,但这是一个 SNOBOL 解决方案。 这个适用于 ASCII。 我正在为 Unicode 做一些事情。

        NUMS = '1234567890'
MAIN    LINE = INPUT                                :F(END)
SWAP    LINE ?  '&#' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP)
        OUTPUT = LINE                               :(MAIN)
END

This will probably earn me some down votes, seeing as this is not a c++, boost or regex response, but here's a SNOBOL solution. This one works for ASCII. Am working on something for Unicode.

        NUMS = '1234567890'
MAIN    LINE = INPUT                                :F(END)
SWAP    LINE ?  '&#' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP)
        OUTPUT = LINE                               :(MAIN)
END
作妖 2024-07-19 01:10:50
* Repaired SNOBOL4 Solution
* &#38; -> &
     digit = '0123456789'
main line = input                        :f(end)
     result = 
swap line arb . l
+    '&#' span(digit) . n ';' rem . line :f(out)
     result = result l char(n)           :(swap)
out  output = result line                :(main)
end
* Repaired SNOBOL4 Solution
* &#38; -> &
     digit = '0123456789'
main line = input                        :f(end)
     result = 
swap line arb . l
+    '&#' span(digit) . n ';' rem . line :f(out)
     result = result l char(n)           :(swap)
out  output = result line                :(main)
end
疾风者 2024-07-19 01:10:50

我不知道 boost 中的正则表达式支持,但检查它是否有一个支持回调或 lambda 或类似的替换()方法。 我想说,这是使用其他语言的正则表达式执行此操作的常用方法。

这是一个 Python 实现:

s = "Hello, here's a test colon:. Here's a test semi-colon;"
re.sub(r'&#(1?\d\d);', lambda match: chr(int(match.group(1))), s)

生成:

"Hello, here's a test colon:. Here's a test semi-colon;"

我现在已经查看了一些 boost,我发现它有一个 regex_replace 函数。 但 C++ 真的让我很困惑,所以我不知道是否可以对替换部分使用回调。 但如果我正确阅读了 boost 文档,则 (\d\d) 组匹配的字符串应该在 $1 中可用。 如果我使用 boost,我会检查一下。

I don't know about the regex support in boost, but check if it has a replace() method that supports callbacks or lambdas or some such. That's the usual way to do this with regexes in other languages I'd say.

Here's a Python implementation:

s = "Hello, here's a test colon:. Here's a test semi-colon;"
re.sub(r'&#(1?\d\d);', lambda match: chr(int(match.group(1))), s)

Producing:

"Hello, here's a test colon:. Here's a test semi-colon;"

I've looked some at boost now and I see it has a regex_replace function. But C++ really confuses me so I can't figure out if you could use a callback for the replace part. But the string matched by the (\d\d) group should be available in $1 if I read the boost docs correctly. I'd check it out if I were using boost.

雪落纷纷 2024-07-19 01:10:50

现有的SNOBOL解决方案不能正确处理多模式的情况,因为只有一个“&”。 以下解决方案应该效果更好:

        dd = "0123456789"
        ccp = "#" span(dd) $ n ";" *?(s = s char(n)) fence (*ccp | null)
   rdl  line = input                              :f(done)
   repl line "&" *?(s = ) ccp = s                 :s(repl)
        output = line                             :(rdl)
   done
   end

The existing SNOBOL solutions don't handle the multiple-patterns case properly, due to there only being one "&". The following solution ought to work better:

        dd = "0123456789"
        ccp = "#" span(dd) $ n ";" *?(s = s char(n)) fence (*ccp | null)
   rdl  line = input                              :f(done)
   repl line "&" *?(s = ) ccp = s                 :s(repl)
        output = line                             :(rdl)
   done
   end
半衬遮猫 2024-07-19 01:10:50

你知道,只要我们偏离主题,perl 替换就有一个“e”选项。 就像评估表达式一样。 例如

echo "您好,这是一个测试冒号:。这是一个测试分号;
进一步测试&#65;.abc.~.def。 ”
| perl -we 'sub 翻译 { my $x=$_[0]; if ( ($x >= 32) && ($x <= 126) )
{ return sprintf("%c",$x); } } else { return "&#".$x.";"; } }
while (<>) { s/&#(1?\d\d);/&translate($1)/ge; 打印; }'

地打印:

#!/usr/bin/perl -w

sub translate
{
  my $x=$_[0];

  if ( ($x >= 32) && ($x <= 126) )
  {
    return sprintf( "%c", $x );
  }
  else
  {
    return "&#" . $x . ";" ;
  }
}

while (<>)
{
  s/&#(1?\d\d);/&translate($1)/ge;
  print;
}

虽然 perl 就是 perl,但我确信有更好的方法来编写它......


回到 C 代码:

您也可以推出自己的有限状态机。 但这会变得混乱并且以后维护起来很麻烦。

Ya know, as long as we're off topic here, perl substitution has an 'e' option. As in evaluate expression. E.g.

echo "Hello, here's a test colon:. Here's a test semi-colon;
Further test &#65;. abc.~.def."
| perl -we 'sub translate { my $x=$_[0]; if ( ($x >= 32) && ($x <= 126) )
{ return sprintf("%c",$x); } else { return "&#".$x.";"; } }
while (<>) { s/&#(1?\d\d);/&translate($1)/ge; print; }'

Pretty-printing that:

#!/usr/bin/perl -w

sub translate
{
  my $x=$_[0];

  if ( ($x >= 32) && ($x <= 126) )
  {
    return sprintf( "%c", $x );
  }
  else
  {
    return "&#" . $x . ";" ;
  }
}

while (<>)
{
  s/&#(1?\d\d);/&translate($1)/ge;
  print;
}

Though perl being perl, I'm sure there's a much better way to write that...


Back to C code:

You could also roll your own finite state machine. But that gets messy and troublesome to maintain later on.

一身骄傲 2024-07-19 01:10:50

这是另一个 Perl 的单行代码(参见 @mrree 的回答):

  • 一个测试文件:
$ cat ent.txt 
Hello,  here's a test colon:. 
Here's a test semi-colon; 'ƒ'
  • 单行:
$ perl -pe's~&#(1?\d\d);~
> sub{ return chr($1) if (31 < $1 && $1 < 127); 
amp; }->()~eg' ent.txt
  • 或使用更具体的正则表达式:
$ perl -pe"s~&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);~chr($1)~eg" ent.txt
  • 两个单行产生相同的输出:
Hello,  here's a test colon:. 
Here's a test semi-colon; 'ƒ'

Here's another Perl's one-liner (see @mrree's answer):

  • a test file:
$ cat ent.txt 
Hello,  here's a test colon:. 
Here's a test semi-colon; 'ƒ'
  • the one-liner:
$ perl -pe's~&#(1?\d\d);~
> sub{ return chr($1) if (31 < $1 && $1 < 127); 
 }->()~eg' ent.txt
  • or using more specific regex:
$ perl -pe"s~&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);~chr($1)~eg" ent.txt
  • both one-liners produce the same output:
Hello,  here's a test colon:. 
Here's a test semi-colon; 'ƒ'
风透绣罗衣 2024-07-19 01:10:50

boost::spirit 解析器生成器框架允许轻松创建一个解析器来转换所需的 NCR

// spirit_ncr2a.cpp
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_core.hpp>

int main() {
  using namespace BOOST_SPIRIT_CLASSIC_NS; 

  std::string line;
  while (std::getline(std::cin, line)) {
    assert(parse(line.begin(), line.end(),
         // match "&#(\d+);" where 32 <= $1 <= 126 or any char
         *(("&#" >> limit_d(32u, 126u)[uint_p][&putchar] >> ';')
           | anychar_p[&putchar])).full); 
    putchar('\n');
  }
}
  • 编译:
    $ g++ -I/path/to/boost -o spirit_ncr2a spirit_ncr2a.cpp
  • 运行:
    $ echo "Hello,  here's a test colon:." | spirit_ncr2a
  • 输出:
    "Hello,  here's a test colon:." 

boost::spirit parser generator framework allows easily to create a parser that transforms desirable NCRs.

// spirit_ncr2a.cpp
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_core.hpp>

int main() {
  using namespace BOOST_SPIRIT_CLASSIC_NS; 

  std::string line;
  while (std::getline(std::cin, line)) {
    assert(parse(line.begin(), line.end(),
         // match "&#(\d+);" where 32 <= $1 <= 126 or any char
         *(("&#" >> limit_d(32u, 126u)[uint_p][&putchar] >> ';')
           | anychar_p[&putchar])).full); 
    putchar('\n');
  }
}
  • compile:
    $ g++ -I/path/to/boost -o spirit_ncr2a spirit_ncr2a.cpp
  • run:
    $ echo "Hello,  here's a test colon:." | spirit_ncr2a
  • output:
    "Hello,  here's a test colon:." 
神妖 2024-07-19 01:10:50

我确实认为我很擅长正则表达式,但我从未见过在正则表达式中使用 lambda,请赐教!

我目前正在使用 python,并且可以用这个 oneliner 解决它:

''.join([x.isdigit() and chr(int(x)) or x for x in re.split('&#(\d+);',THESTRING)])

这有什么意义吗?

I did think I was pretty good at regex but I have never seen lambdas been used in regex, please enlighten me!

I'm currently using python and would have solved it with this oneliner:

''.join([x.isdigit() and chr(int(x)) or x for x in re.split('&#(\d+);',THESTRING)])

Does that make any sense?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文