正则表达式 - 当连续出现多个空格字符时删除空格字符，但排除所有注释掉的行

发布于 2025-01-10 14:08:53 字数 1111 浏览 0 评论 0原文

假设我有如下几行：

01090   C   -------CALCULATION OF SOMETHING--
01100   "SOME.VARIABLE"   =  "SOME.OTHER.VARIABLE" + 2
01110   IF("SOME.VARIABLE" .NE.  "SOME.VALUE")  THEN   ON("SOME.MACHINE")

我想遍历该程序并删除所有连续有多个空格字符。例如，第 01100 行“=”之前有三 (3) 个空格字符，后面有两 (2) 个空格字符。在第 01110 行中，有多个不同位置具有超过 1 个连续空格字符。我想用一个空格字符替换它们。我不想删除/更改注释行 01090 中包含的空格。

所有行都以 5 位数字开头，所有行在行号后面都有一个制表符，并且只有注释行才有“C”或“c”这表示它们已被注释掉。

我正在使用 Sublime3 和 boost 正则表达式。我尝试过类似的事情：

(?!\t[Cc] )[ ]{2,}
(?!\t[Cc])[ ]{2,}

我似乎无法确定如何在不捕获整行的情况下否定整行。

我也尝试在开头添加插入符号，但这似乎没有帮助。

基本上，如果该行有一个“TAB”后跟一个“c”或“C”，则忽略整个内容。否则，将定位任意两个或多个连续的空格字符并将其替换为单个空格字符。

编辑

--------- 解决方案 ---------

感谢 Wiktor 和第四只鸟的输入，我能够确定解决方案。非常感谢两人。这就是我最终得到的结果：

^\d+\t[cC].*\K|[ ]{2,}

我还确定，如果行尾有多余的空格，我可能也想忽略它们，这样我就可以使用不同的正则表达式搜索完全删除它们。最终的产品是这样的：

^\d+\t[cC].*\K|[ ]*\n\K|[ ]{2,}

如果我没有受到boost或者PCRE引擎的限制，我相信我之前的一次失败的尝试实际上是可行的。如果它对其他人有帮助，我会将其包含在这里。它不能在 boost 或 PCRE 中使用，因为它们不支持无限向后查找：

(?<!\t[cC].*)[ ]{2,}

原文

Let's say I have a few lines as follows:

01090   C   -------CALCULATION OF SOMETHING--
01100   "SOME.VARIABLE"   =  "SOME.OTHER.VARIABLE" + 2
01110   IF("SOME.VARIABLE" .NE.  "SOME.VALUE")  THEN   ON("SOME.MACHINE")

I would like to go through the program and remove all of the space characters that have more than one in succession. For example, line 01100 has three (3) space characters before the "=" and two (2) after. In line 01110, there are several different locations with more than 1 consecutive space char.
I would like to replace them with just a single space char. I do NOT want to remove/alter the spaces that are contained within the commented line 01090.

All lines begin with 5 digits, all lines have a tab following the line number, and only commented lines have a "C" or a "c" that denotes them as commented out.

I am using Sublime3, and boost regex. I have tried things like:

(?!\t[Cc] )[ ]{2,}
(?!\t[Cc])[ ]{2,}

I can't seem to determine how to negate an entire line without also capturing an entire line.

I tried putting a caret in the beginning as well, but that didn't seem to help.

Basically, if the line has a "TAB" followed by a "c" or a "C", then ignore the entire thing. Otherwise, any two or more consecutive space chars are located and replaced with a single space char.

EDIT

--------- solution ---------

Thanks to the input from Wiktor and The fourth bird, I was able to determine the solution. Many thanks to both. Here's what I ended up with:

^\d+\t[cC].*\K|[ ]{2,}

I also determined that should there be extra spaces at the end of a line, I might want to ignore those as well so I can remove them completely with a different regex search. The final product looks like this:

^\d+\t[cC].*\K|[ ]*\n\K|[ ]{2,}

If I had not been limited by the engine of boost or PCRE, I believe one of my previous failed attempts would actually work. I'll include that here in the event it helps someone else. It can't be used in boost or PCRE because they don't support infinite lookbehinds:

(?<!\t[cC].*)[ ]{2,}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

够运 2025-01-17 14:08:54

您必须在正则表达式中添加否定前瞻和否定后视。尝试这样的事情。

(?<![Cc])\s{3,}(?![Cc])

You have to add a negative lookahead and a negative lookbehid to your regex. Try something like this.

(?<![Cc])\s{3,}(?![Cc])

回复收藏 0 原文

你没皮卡萌 2025-01-17 14:08:54

我认为您可能实际上正在准备解析这种语言。解析器使用正则表达式通常并不方便。

另外，您没有询问，但在这种情况下，转换可能就地（因为输出比输入短或长度相等）。

我建议像这样的 PEG 语法（使用 Boost Spirit）：

template <typename In, typename Out>
Out compress_whitespace(In f, In l, Out out) {
    auto copy = [&out](auto& ctx) {
        struct Append {
            static void call(Out& out, char ch) { *out++ = ch; }
            static void call(Out& out, boost::iterator_range<In> raw) {
                for (auto ch : raw) *out++ = ch; }
        };

        Append::call(out, _attr(ctx));
    };

    using namespace boost::spirit::x3;
    auto prefix  = raw[uint_ >> "   "][copy];
    auto comment = raw["C " >> *(char_ - eol)][copy];
    auto code_ch = omit[+blank] >> attr(' ')[copy] | (char_ - eol)[copy];
    auto line    = prefix >> (comment | *code_ch);
    auto newline = raw[eol][copy];

    parse(f, l, -line % newline);
    return out;
}

禁止空行：

    parse(f, l, line % newline);

要抛出不完整/无效的输入，请更改 parse 行：

    parse(f, l, expect[line % newline >> *newline >> eoi]);

现场演示

int main(int argc, char** argv)
{
    std::ostreambuf_iterator out(std::cout);

    for (std::string file : std::vector(argv+1, argv+argc)) {
        std::ifstream s(file, std::ios::binary);
        std::string const program(std::istreambuf_iterator<char>{s}, {});

        compress_whitespace(begin(program), end(program), out);
    }
}

使用 vim -d input.txt 输出<(./sotest input.txt):

奖励：就地处理

由于我们知道输出的长度将相同或更短，因此您可以负担就地处理：

    std::string program = R"~(
01090   C   -------CALCULATION OF SOMETHING--
01100   "SOME.VARIABLE"   =  "SOME.OTHER.VARIABLE" + 2
01110   IF("SOME.VARIABLE" .NE.  "SOME.VALUE")  THEN   ON("SOME.MACHINE"))~";

    auto b = begin(program), e = end(program),
         new_e = compress_whitespace(b, e, b);

    std::cout << "Shorter by " << (e - new_e) << " chars\n";
    program.erase(new_e, e);

    std::cout << program << "\n";

查看它 <强>Live On Coliru，打印：

Shorter by 7 chars

01090   C   -------CALCULATION OF SOMETHING--
01100   "SOME.VARIABLE" = "SOME.OTHER.VARIABLE" + 2
01110   IF("SOME.VARIABLE" .NE. "SOME.VALUE") THEN ON("SOME.MACHINE")

I'm think you might actually be preparing to parse this language. Parsers aren't often convenient using regular expressions.

Also, you didn't ask but in this case the transformation could be in-place (since the output is shorter than the input, or of equal length).

I'd suggest a PEG grammar like this (using Boost Spirit):

template <typename In, typename Out>
Out compress_whitespace(In f, In l, Out out) {
    auto copy = [&out](auto& ctx) {
        struct Append {
            static void call(Out& out, char ch) { *out++ = ch; }
            static void call(Out& out, boost::iterator_range<In> raw) {
                for (auto ch : raw) *out++ = ch; }
        };

        Append::call(out, _attr(ctx));
    };

    using namespace boost::spirit::x3;
    auto prefix  = raw[uint_ >> "   "][copy];
    auto comment = raw["C " >> *(char_ - eol)][copy];
    auto code_ch = omit[+blank] >> attr(' ')[copy] | (char_ - eol)[copy];
    auto line    = prefix >> (comment | *code_ch);
    auto newline = raw[eol][copy];

    parse(f, l, -line % newline);
    return out;
}

To disallow empty lines:

    parse(f, l, line % newline);

To throw at incomplete/invalid input change the parse line:

    parse(f, l, expect[line % newline >> *newline >> eoi]);

Live Demo

int main(int argc, char** argv)
{
    std::ostreambuf_iterator out(std::cout);

    for (std::string file : std::vector(argv+1, argv+argc)) {
        std::ifstream s(file, std::ios::binary);
        std::string const program(std::istreambuf_iterator<char>{s}, {});

        compress_whitespace(begin(program), end(program), out);
    }
}

Output using vim -d input.txt <(./sotest input.txt):

BONUS: In place processing

Since we know the output will be same length or less, you can afford to process inplace:

    std::string program = R"~(
01090   C   -------CALCULATION OF SOMETHING--
01100   "SOME.VARIABLE"   =  "SOME.OTHER.VARIABLE" + 2
01110   IF("SOME.VARIABLE" .NE.  "SOME.VALUE")  THEN   ON("SOME.MACHINE"))~";

    auto b = begin(program), e = end(program),
         new_e = compress_whitespace(b, e, b);

    std::cout << "Shorter by " << (e - new_e) << " chars\n";
    program.erase(new_e, e);

    std::cout << program << "\n";

See it Live On Coliru, printing:

Shorter by 7 chars

01090   C   -------CALCULATION OF SOMETHING--
01100   "SOME.VARIABLE" = "SOME.OTHER.VARIABLE" + 2
01110   IF("SOME.VARIABLE" .NE. "SOME.VALUE") THEN ON("SOME.MACHINE")

回复收藏 0 原文

顾北清歌寒 2025-01-17 14:08:53

您可以使用

^\d{5}\t[cC] .*$(*SKIP)(*FAIL)|\h{2,}

^ 字符串开头
\d{5}\t 匹配 5 位数字和制表符
[cC] 匹配 c 或 C 和空格
.*$ 匹配该行的其余部分
(*SKIP)(*FAIL) 跳过匹配
<代码>| 或者
\h{2,} 匹配 2 个或更多水平空白字符

在替换中使用单个空格。

Regex demo

正如OP所解决的，使用\K 并匹配 2 个或多个空格：

^\d+\tC.*\K|[ ]{2,}

^ 字符串开头
\d+ 匹配 1 个或多个数字
\tC 匹配制表符和C char
.*\K 匹配该行的其余部分并清除匹配缓冲区
| 或
[ ]{2,}< /code> 匹配 2 个或更多空格（方括号仅用于可见性，并非必需）

正则表达式演示

You might use

^\d{5}\t[cC] .*$(*SKIP)(*FAIL)|\h{2,}

^ Start of string
\d{5}\t Match 5 digits and a tab
[cC] Match either c or C and a space
.*$ Match the rest of the line
(*SKIP)(*FAIL) Skip the match
| Or
\h{2,} Match 2 or more horizontal whitespace chars

In the replacement use a single space.

Regex demo

As solved by the OP, a shortened version using \K and matching 2 or more spaces:

^\d+\tC.*\K|[ ]{2,}

^ Start of string
\d+ Match 1 or more digits
\tC Match a tab and a C char
.*\K Match the rest of the line and clear the match buffer
| Or
[ ]{2,} Match 2 or more spaces (square brackets are only for visibility and not necessary)

Regex demo

回复收藏 0 原文

~没有更多了~