使用字边界的正则表达式来匹配 javascript 中的字母数字和非字母数字字符

发布于 2024-10-03 00:19:13 字数 981 浏览 5 评论 0原文

我正在尝试使用 JavaScript 和正则表达式突出显示一组关键字,我面临一个问题,我的关键字可能包含文字和特殊字符,如 @text #number 等。我使用字边界来匹配和替换整个单词,而不是部分单词(包含在另一个单词中)。

var pattern = new regex('\b '( + keyword +')\b',gi);

在这里,此表达式匹配整个关键字并突出显示它们,但以防万一像“number:”这样的任何关键字没有突出显示。

我知道 \bword\b 匹配单词边界,特殊字符是非字母数字字符,因此与上面的表达式不匹配。 你能让我知道我可以使用什么正则表达式来完成上述任务吗?

==更新==

对于上面的内容,我尝试了 Tim Pietzcker 对以下正则表达式的建议,

expr: (?:^|\\b|\\s)(" + keyword + ")(?:$|\\b|\\s)

上面的内容似乎可以让我与字母数字和非字母数字字符的整个单词匹配,但是每当关键字在关键字之前或之后有连续的 html 标记,且不带空格,则不会突出显示该关键字(例如,社会保险号*号码:
< /em>*) 我尝试了以下正则表达式,但它替换了关键字

expr: (?:^|\b|\s|<[^>]+>)number:(?:$|\b|\s|<[^>]+>) 

Here 之前的 html 标签,其中关键字 number: 具有 < br > (为 br 标签故意添加空格,以避免浏览器解释该标签)接下来的中间没有空格的内容将通过关键字突出显示。

您能否建议一个表达式,该表达式将忽略包含字母数字和非字母数字字符的整个单词的连续 html 标记。

I am trying to highlight a set of keywords using JavaScript and regex, I facing one problem, my keyword may contain literal and special characters as in @text #number etc. I am using word boundary to match and replace the whole word and not a partial word (contained within another word).

var pattern = new regex('\b '( + keyword +')\b',gi);

Here this expression matches the whole keywords and highlights them, however in case if any keyword like "number:" do not get highlighted.

I am aware that \bword\b matches for a word boundary and special characters are non alphanumeric characters hence are not matched by the above expression.
Can you let me know what regex expression I can use to accomplish the above.

==Update==

For the above I tried Tim Pietzcker's suggestion for the below regex,

expr: (?:^|\\b|\\s)(" + keyword + ")(?:$|\\b|\\s)

The above seems to be working for getting me a match for the whole word with alphanumeric and non alphanumeric characters, however whenever a keyword has consecutive html tag before or after the keyword without a space, it does not highlight that keyword (e.g. social security *number:< br >*)
I tried the following regex, but it replaces the html tag preceding the keyword

expr: (?:^|\b|\s|<[^>]+>)number:(?:$|\b|\s|<[^>]+>) 

Here for the keyword number: which has < br > (space added intentionally for br tag to avoid browser interpreting the tag) coming next without space in between gets highlighted with the keyword.

Can you suggest an expression which would ignore the consecutive html tag for the whole word with both alphanumeric and non alphanumeric characters.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

烟柳画桥 2024-10-10 00:19:13

2021更新:JS现在支持lookbehind,所以这个答案有点过时了。

好的,所以你有两个问题:JavaScript 不支持lookbehind,并且 \b 只能查找字母数字字符和非字母数字字符之间的边界。

第一个问题:您的关键字的字边界到底是什么?我的猜测是它必须是 \b 边界或空格。如果是这种情况,您可以搜索

"(?:^|\\b|\\s)(" + keyword + ")(?:$|\\b|\\s)"

当然,诸如 @number# 之类的关键字周围的空白字符也会成为匹配的一部分,但也许突出显示这些字符并不是一个问题。在其他情况下,即如果存在可以匹配的实际单词边界,则空格将不会成为匹配的一部分,因此在大多数情况下它应该可以正常工作。

您感兴趣的实际单词将在反向引用 #1 中,因此如果您可以单独突出显示该单词,那就更好了。

编辑:
如果除空格之外的其他字符可能出现在关键字之后/之前,那么我认为您唯一能做的事情(如果您坚持使用 JavaScript)是:

  1. 检查您的关键字是否以 alnum 字符开头。
  2. 如果是这样,请将 \b 添加到您的正则表达式中。
  3. 检查您的关键字是否以数字字符结尾。
  4. 如果是这样,请将 \b 附加到您的正则表达式中。

因此,对于关键字,请使用\bkeyword\b;对于 number:,使用 \bnumber:;对于@twitter,请使用@twitter\b

2021 update: JS now supports lookbehind so this answer is a little outdated.

OK, so you have two problems: JavaScript doesn't support lookbehind, and \b only finds boundaries between alphanumeric and non-alphanumeric characters.

The first question: What exactly does constitute a word boundary for your keywords? My guess is that it must be either a \b boundary or whitespace. If that is the case, you could search for

"(?:^|\\b|\\s)(" + keyword + ")(?:$|\\b|\\s)"

Of course the whitespace characters around keywords like @number# would also become part of the match, but perhaps highlighting those isn't such a problem. In other cases, i. e. if there is an actual word boundary that can match, the spaces won't be part of the match so it should work fine in the majority of cases.

The actual word you're interested in will be in backreference #1, so if you can highlight that separately, even better.

EDIT:
If other characters than space may occur after/before a keyword, then I think the only thing you can do (if you're stuck with JavaScript) is:

  1. Check if your keyword starts with an alnum character.
  2. If so, prepend \b to your regex.
  3. Check if your keyword ends with an alnum character.
  4. If so, append \b to your regex.

So, for keyword, use \bkeyword\b; for number:, use \bnumber:; for @twitter, use @twitter\b.

苏大泽ㄣ 2024-10-10 00:19:13

我们需要查找两侧都有空格字符的子字符串。如果 JavaScript 支持lookbehind,这将看起来像:

var re = new RegExp('(?<!\\S)' + keyword + '(?!\\S)', 'gi');

但这不起作用(但在 Perl 和其他脚本语言中可以)。相反,我们需要包含前导空白字符(或字符串的开头)作为匹配的开始部分(并且可以选择将我们真正要查找的内容捕获到 $1 中):

var re = new RegExp('(?:^|\\s)(' + keyword + ')(?!\\S)', 'gi');

只需考虑 >任何匹配开始的真正位置将是一个字符之后re.exec(string)<返回的.index属性返回的内容/code>,如果您要访问匹配的字符串,则需要使用 .slice(1) 删除第一个字符,或者只需访问捕获的内容。

We need to look for a substring that has a whitespace character on both sides. If JavaScript supported lookbehind, this would look like:

var re = new RegExp('(?<!\\S)' + keyword + '(?!\\S)', 'gi');

That won't work though (but would in Perl and other scripting languages). Instead, we need to include the leading whitespace character (or beginning of string) as the beginning part of the match (and optionally capture what we are really looking for into $1):

var re = new RegExp('(?:^|\\s)(' + keyword + ')(?!\\S)', 'gi');

Just consider that the real place where any match starts will be one character after what is returned by the .index property returned by re.exec(string), and that if you are accessing the matched string, you either need to remove the first character with .slice(1) or simply access what is captured.

奶茶白久 2024-10-10 00:19:13

也许你想做的是

'\b\W*(' + keyword + ')\W*\b'

maybe what you're trying to do is

'\b\W*(' + keyword + ')\W*\b'
别在捏我脸啦 2024-10-10 00:19:13

前向和后向是您的答案:"(?=<[\s^])" + 关键字 + "(?=[\s$])"。括号中的位不包含在匹配中,因此请包含其中关键字中不允许的任何字符。

Lookahead and lookbehind are your answer: "(?=<[\s^])" + keyword + "(?=[\s$])". The bits in brackets aren't included in the match, so include whatever characters aren't permitted in the keywords in there.

我一直都在从未离去 2024-10-10 00:19:13

正如 Tim 正确指出的那样, \b 是一些棘手的东西,其工作方式与人们通常认为的工作方式不同。阅读这个答案 了解有关此问题的更多详细信息以及您可以采取的措施。

简而言之,这是向左的边界:

(?(?=\w)(?<!\w)|(?<!\W))

这是向右的边界:

(?(?<=\w)(?!\w)|(?!\W))

人们总是认为涉及到空间,但实际上并不存在。然而,既然您知道了真正的定义,就可以轻松地将其构建到其中。在上述两种模式中,可以将 \w\W 替换为 \s\S 。或者可以将空白意识添加到 else 块中。

As Tim correctly points out, \b are tricky things that work differently than the way people often think they work. Read this answer for more details about this matter, and what you can do about it.

In brief, this is a boundary to the left:

(?(?=\w)(?<!\w)|(?<!\W))

and this is a boundary to the right:

(?(?<=\w)(?!\w)|(?!\W))

People always think there are spaces involved, but there aren’t. However, now that you know the real definitions, it’s easy to build that into them. One could swap out \w and \W in echange for \s and \Sin the two patterns above. Or one could add in whitespace awareness to the else blocks.

朕就是辣么酷 2024-10-10 00:19:13

试试这个它应该可以工作...

var pattern = new regex(@"\b"+Regex.escape(keyword)+@"\b",gi);

Try this it should work...

var pattern = new regex(@"\b"+Regex.escape(keyword)+@"\b",gi);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文