正则表达式匹配不在引号内的所有实例

发布于 2024-11-17 02:07:32 字数 562 浏览 1 评论 0原文

来自此问题,我推断,匹配给定正则表达式的所有实例在引号内是不可能的。也就是说,它无法匹配转义引号(例如:“应该采用整个\”match \“”)。如果有一种我不知道的方法可以解决我的问题。

但是,如果没有,我想知道是否有任何可以在 JavaScript 中使用的有效替代方案。我已经考虑了一下,但无法提供任何适用于大多数(如果不是全部)情况的优雅解决方案。

具体来说,我只需要使用 .split() 和 .replace() 方法的替代方法,但如果它可以更通用,那将是最好的。

例如:
输入字符串:
+bar+baz"not+or\"+or+\"this+"foo+bar+
将 + 替换为 #(不在引号内)将返回:
#bar#baz"not+or\"+or+\"this+"foo#bar#

From this q/a, I deduced that matching all instances of a given regex not inside quotes, is impossible. That is, it can't match escaped quotes (ex: "this whole \"match\" should be taken"). If there is a way to do it that I don't know about, that would solve my problem.

If not, however, I'd like to know if there is any efficient alternative that could be used in JavaScript. I've thought about it a bit, but can't come with any elegant solutions that would work in most, if not all, cases.

Specifically, I just need the alternative to work with .split() and .replace() methods, but if it could be more generalized, that would be the best.

For Example:
An input string of:
+bar+baz"not+or\"+or+\"this+"foo+bar+
replacing + with #, not inside quotes, would return:
#bar#baz"not+or\"+or+\"this+"foo#bar#

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

实际上,您可以匹配不在任何字符串的引号内的正则表达式的所有实例,其中每个左引号再次闭合。假设,如上面的示例所示,您想要匹配 \+

这里的关键观察是,如果一个单词后面有偶数个引号,则该单词位于引号之外。这可以建模为前瞻断言:

\+(?=([^"]*"[^"]*")*[^"]*$)

现在,您不希望计算转义引号。这变得有点复杂。您还需要考虑反斜杠并使用 [^"\\]*,而不是前进到下一个引号的 [^"]* 。到达反斜杠或引号后,如果遇到反斜杠,则需要忽略下一个字符,否则前进到下一个未转义的引号。看起来像 (\\.|"([^"\\]*\\.)*[^"\\]*")。综合起来,

\+(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)

我承认这有点神秘。 =)

Actually, you can match all instances of a regex not inside quotes for any string, where each opening quote is closed again. Say, as in you example above, you want to match \+.

The key observation here is, that a word is outside quotes if there are an even number of quotes following it. This can be modeled as a look-ahead assertion:

\+(?=([^"]*"[^"]*")*[^"]*$)

Now, you'd like to not count escaped quotes. This gets a little more complicated. Instead of [^"]* , which advanced to the next quote, you need to consider backslashes as well and use [^"\\]*. After you arrive at either a backslash or a quote, you need to ignore the next character if you encounter a backslash, or else advance to the next unescaped quote. That looks like (\\.|"([^"\\]*\\.)*[^"\\]*"). Combined, you arrive at

\+(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)

I admit it is a little cryptic. =)

回首观望 2024-11-24 02:07:32

阿兹米索夫,重新提出这个问题是因为您说您正在寻找可以在 JavaScript 中使用的任何有效替代方案以及任何适用于大多数(如果不是全部)情况的优雅解决方案

恰好有一个没有提到的简单、通用的解决方案。

与替代方案相比,此解决方案的正则表达式非常简单:

"[^"]+"|(\+)

其想法是我们匹配但忽略引号内的任何内容以中和该内容(在交替的左侧)。在右侧,我们捕获所有未中和到组 1 中的 +,并且替换函数检查组 1。以下是完整的工作代码:

<script>
var subject = '+bar+baz"not+these+"foo+bar+';
var regex = /"[^"]+"|(\+)/g;
replaced = subject.replace(regex, function(m, group1) {
    if (!group1) return m;
    else return "#";
});
document.write(replaced);

在线演示

可以用同样的原理进行匹配或拆分。请参阅参考文献中的问题和文章,其中还将为您提供代码示例。

希望这能让您对一种非常通用的方法有不同的想法。 :)

空字符串怎么样?

以上是展示该技术的一般答案。它可以根据您的具体需求进行调整。如果您担心文本可能包含空字符串,只需将字符串捕获表达式中的量词从 + 更改为 *

"[^"]*"|(\+)

请参阅 演示

转义引号怎么样?

同样,上面是展示该技术的一般答案。不仅可以根据您的需求优化“忽略此匹配”正则表达式,您还可以添加多个要忽略的表达式。例如,如果您想确保转义引号被充分忽略,您可以首先在其他两个前面添加一个替换 \\"| ,以便匹配(并忽略)散乱的转义双精度 接下来,

在捕获双引号字符串内容的 "[^"]*" 部分中,您可以添加一个替换以确保转义的双引号在其 "< 之前匹配。 /code> 有机会变成结束语哨兵,将其转换为 "(?:\\"|[^"])*"

生成的表达式具有三个分支:

  1. \\" 进行匹配,以及 忽略
  2. "(?:\\"|[^"])*" 进行匹配,忽略
  3. (\+) 进行匹配匹配,捕获和处理

请注意,在其他正则表达式风格中,我们可以使用lookbehind可以更轻松地完成这项工作,但JS不支持它。

完整的正则表达式变为:

\\"|"(?:\\"|[^"])*"|(\+)

请参阅 正则表达式演示完整脚本

参考

  1. 如何匹配除 s1、s2、s3 情况之外的模式
  2. 如何匹配模式,除非...

Azmisov, resurrecting this question because you said you were looking for any efficient alternative that could be used in JavaScript and any elegant solutions that would work in most, if not all, cases.

There happens to be a simple, general solution that wasn't mentioned.

Compared with alternatives, the regex for this solution is amazingly simple:

"[^"]+"|(\+)

The idea is that we match but ignore anything within quotes to neutralize that content (on the left side of the alternation). On the right side, we capture all the + that were not neutralized into Group 1, and the replace function examines Group 1. Here is full working code:

<script>
var subject = '+bar+baz"not+these+"foo+bar+';
var regex = /"[^"]+"|(\+)/g;
replaced = subject.replace(regex, function(m, group1) {
    if (!group1) return m;
    else return "#";
});
document.write(replaced);

Online demo

You can use the same principle to match or split. See the question and article in the reference, which will also point you code samples.

Hope this gives you a different idea of a very general way to do this. :)

What about Empty Strings?

The above is a general answer to showcase the technique. It can be tweaked depending on your exact needs. If you worry that your text might contain empty strings, just change the quantifier inside the string-capture expression from + to *:

"[^"]*"|(\+)

See demo.

What about Escaped Quotes?

Again, the above is a general answer to showcase the technique. Not only can the "ignore this match" regex can be refined to your needs, you can add multiple expressions to ignore. For instance, if you want to make sure escaped quotes are adequately ignored, you can start by adding an alternation \\"| in front of the other two in order to match (and ignore) straggling escaped double quotes.

Next, within the section "[^"]*" that captures the content of double-quoted strings, you can add an alternation to ensure escaped double quotes are matched before their " has a chance to turn into a closing sentinel, turning it into "(?:\\"|[^"])*"

The resulting expression has three branches:

  1. \\" to match and ignore
  2. "(?:\\"|[^"])*" to match and ignore
  3. (\+) to match, capture and handle

Note that in other regex flavors, we could do this job more easily with lookbehind, but JS doesn't support it.

The full regex becomes:

\\"|"(?:\\"|[^"])*"|(\+)

See regex demo and full script.

Reference

  1. How to match pattern except in situations s1, s2, s3
  2. How to match a pattern unless...
对你而言 2024-11-24 02:07:32

您可以分三步完成。

  1. 使用正则表达式全局替换将所有字符串主体内容提取到侧表中。
  2. 进行逗号翻译
  3. 使用正则表达式全局替换将字符串主体交换回

下面的代码

// Step 1
var sideTable = [];
myString = myString.replace(
    /"(?:[^"\\]|\\.)*"/g,
    function (_) {
      var index = sideTable.length;
      sideTable[index] = _;
      return '"' + index + '"';
    });
// Step 2, replace commas with newlines
myString = myString.replace(/,/g, "\n");
// Step 3, swap the string bodies back
myString = myString.replace(/"(\d+)"/g,
    function (_, index) {
      return sideTable[index];
    });

如果您在设置后运行它

myString = '{:a "ab,cd, efg", :b "ab,def, egf,", :c "Conjecture"}';

,您应该得到

{:a "ab,cd, efg"
 :b "ab,def, egf,"
 :c "Conjecture"}

它的工作原理,因为在步骤 1 之后,

myString = '{:a "0", :b "1", :c "2"}'
sideTable = ["ab,cd, efg", "ab,def, egf,", "Conjecture"];

myString 中唯一的逗号是外部字符串。步骤2,然后将逗号变成换行符:

myString = '{:a "0"\n :b "1"\n :c "2"}'

最后我们将仅包含数字的字符串替换为其原始内容。

You can do it in three steps.

  1. Use a regex global replace to extract all string body contents into a side-table.
  2. Do your comma translation
  3. Use a regex global replace to swap the string bodies back

Code below

// Step 1
var sideTable = [];
myString = myString.replace(
    /"(?:[^"\\]|\\.)*"/g,
    function (_) {
      var index = sideTable.length;
      sideTable[index] = _;
      return '"' + index + '"';
    });
// Step 2, replace commas with newlines
myString = myString.replace(/,/g, "\n");
// Step 3, swap the string bodies back
myString = myString.replace(/"(\d+)"/g,
    function (_, index) {
      return sideTable[index];
    });

If you run that after setting

myString = '{:a "ab,cd, efg", :b "ab,def, egf,", :c "Conjecture"}';

you should get

{:a "ab,cd, efg"
 :b "ab,def, egf,"
 :c "Conjecture"}

It works, because after step 1,

myString = '{:a "0", :b "1", :c "2"}'
sideTable = ["ab,cd, efg", "ab,def, egf,", "Conjecture"];

so the only commas in myString are outside strings. Step 2, then turns commas into newlines:

myString = '{:a "0"\n :b "1"\n :c "2"}'

Finally we replace the strings that only contain numbers with their original content.

小糖芽 2024-11-24 02:07:32

尽管 zx81 的答案似乎是性能最好且干净的答案,但它需要这些修复才能正确捕获转义的引号:

var subject = '+bar+baz"not+or\\"+or+\\"this+"foo+bar+';

以及

var regex = /"(?:[^"\\]|\\.)*"|(\+)/g;

已经提到的“group1 === undefined”或“!group1”。
特别是 2. 似乎很重要,实际上要考虑到原始问题中提出的所有问题。

应该指出的是,此方法隐式要求字符串在未转义的引号对之外没有转义的引号。

Although the answer by zx81 seems to be the best performing and clean one, it needes these fixes to correctly catch the escaped quotes:

var subject = '+bar+baz"not+or\\"+or+\\"this+"foo+bar+';

and

var regex = /"(?:[^"\\]|\\.)*"|(\+)/g;

Also the already mentioned "group1 === undefined" or "!group1".
Especially 2. seems important to actually take everything asked in the original question into account.

It should be mentioned though that this method implicitly requires the string to not have escaped quotes outside of unescaped quote pairs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文