掩盖除第一个字母以外的所有坏词

发布于 2024-10-17 16:05:52 字数 387 浏览 6 评论 0原文

我正在尝试在 PHP 中创建一个坏词过滤器,它将搜索文本,与一组已知的坏词进行匹配,然后用星号替换坏词中的每个字符(第一个字母除外)。

示例:

  • fook 会变成 f***
  • shoot 会变成 s****

我唯一不知道的部分不知道如何保留字符串中的第一个字母,以及如何在保持相同字符串长度的同时用其他字母替换剩余的字母。

我的代码不合适,因为它总是用 3 个星号替换整个单词。

$string = preg_replace("/\b(". $word .")\b/i", "***", $string);

I'm attempting to create a bad word filter in PHP that will search a text, match against an array of known bad words, then replace each character (except the first letter) in the bad word with an asterisk.

Example:

  • fook would become f***
  • shoot would become s****

The only part I don't know is how to keep the first letter in the string, and how to replace the remaining letters with something else while keeping the same string length.

My code is unsuitable because it always replaces the whole word with exactly 3 asterisks.

$string = preg_replace("/\b(". $word .")\b/i", "***", $string);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

吃颗糖壮壮胆 2024-10-24 16:05:52
$string = 'fook would become';
$word = 'fook';

$string = preg_replace("~\b". preg_quote($word, '~') ."\b~i", $word[0] . str_repeat('*', strlen($word) - 1), $string);

var_dump($string);
$string = 'fook would become';
$word = 'fook';

$string = preg_replace("~\b". preg_quote($word, '~') ."\b~i", $word[0] . str_repeat('*', strlen($word) - 1), $string);

var_dump($string);
秋意浓 2024-10-24 16:05:52

这可以通过多种方式完成,使用非常奇怪的自动生成的正则表达式......
但我相信使用 preg_replace_callback() 最终会变得更加健壮

<?php
# as already pointed out, your words *may* need sanitization

foreach($words as $k=>$v)
  $words[$k]=preg_quote($v,'/');

# and to be collapsed into a **big regexpy goodness**
$words=implode('|',$words);


# after that, a single preg_replace_callback() would do

$string = preg_replace_callback('/\b('. $words .')\b/i', "my_beloved_callback", $string);

function my_beloved_callback($m)
{
  $len=strlen($m[1])-1;

  return $m[1][0].str_repeat('*',$len);
}

This can be done in many ways, with very weird auto-generated regexps...
But I believe using preg_replace_callback() would end up being more robust

<?php
# as already pointed out, your words *may* need sanitization

foreach($words as $k=>$v)
  $words[$k]=preg_quote($v,'/');

# and to be collapsed into a **big regexpy goodness**
$words=implode('|',$words);


# after that, a single preg_replace_callback() would do

$string = preg_replace_callback('/\b('. $words .')\b/i', "my_beloved_callback", $string);

function my_beloved_callback($m)
{
  $len=strlen($m[1])-1;

  return $m[1][0].str_repeat('*',$len);
}
囚我心虐我身 2024-10-24 16:05:52
$string = preg_replace("/\b".$word[0].'('.substr($word, 1).")\b/i", "***", $string);
$string = preg_replace("/\b".$word[0].'('.substr($word, 1).")\b/i", "***", $string);
一直在等你来 2024-10-24 16:05:52

假设要屏蔽的不良单词黑名单完全由字母或至少单词字符(允许数字和下划线)组成,则在内爆和插入之前,您不需要调用 preg_quote()正则表达式模式。

在匹配限定词的第一个字母后,使用 \G 元字符继续匹配。坏单词中每个后续匹配的字母都将被一对一地替换为星号。

\K 用于忘记/释放坏词的第一个字母。

这种方法无需调用 preg_replace_callback() 来测量每个匹配的字符串,也无需在文本块中每个匹配的错误单词的第一个字母后写入 N 个星号。

细分:

/                      #start of pattern delimiter
(?:                    #non-capturing group to encapsulate logic
   \b                  #position separating word character and non-word character
   (?=                 #start lookahead -- to match without consuming letters
      (?:fook|shoot)   #OR-delimited bad words
      \b               #position separating word character and non-word character
   )                   #end lookahead
   \w                  #first word character of bad word
   \K                  #forget first matched word character
   |                   #OR -- to set up \G technique
   \G(?!^)             #continue matching from previous match but not from the start of the string
)                      #end of non-capturing group
\w                     #match non-first letter of bad word
/                      #ending pattern delimiter
i                      #make pattern case-insensitive

代码:(演示)

$bad = ['fook', 'shoot'];
$pattern = '/(?:\b(?=(?:' . implode('|', $bad) . ')\b)\w\K|\G(?!^))\w/i';

echo preg_replace($pattern, '*', 'Holy fook n shoot, Batman; The Joker\'s shooting The Riddler!');
// Holy f*** n s****, Batman; The Joker's shooting The Riddler!

Assuming your blacklist of bad words to be masked are fully comprised of letters or at least of word characters (allowing for digits and underscores), you won't need to call preg_quote() before imploding and inserting into the regex pattern.

Use the \G metacharacter to continue matching after the first letter of a qualifying word is matched. Every subsequently matched letter in the bad word will be replaced 1-for-1 with an asterisk.

\K is used to forget/release the first letter of the bad word.

This approach removes the need to call preg_replace_callback() to measure every matched string and write N asterisks after the first letter of every matches bad word in a block of text.

Breakdown:

/                      #start of pattern delimiter
(?:                    #non-capturing group to encapsulate logic
   \b                  #position separating word character and non-word character
   (?=                 #start lookahead -- to match without consuming letters
      (?:fook|shoot)   #OR-delimited bad words
      \b               #position separating word character and non-word character
   )                   #end lookahead
   \w                  #first word character of bad word
   \K                  #forget first matched word character
   |                   #OR -- to set up \G technique
   \G(?!^)             #continue matching from previous match but not from the start of the string
)                      #end of non-capturing group
\w                     #match non-first letter of bad word
/                      #ending pattern delimiter
i                      #make pattern case-insensitive

Code: (Demo)

$bad = ['fook', 'shoot'];
$pattern = '/(?:\b(?=(?:' . implode('|', $bad) . ')\b)\w\K|\G(?!^))\w/i';

echo preg_replace($pattern, '*', 'Holy fook n shoot, Batman; The Joker\'s shooting The Riddler!');
// Holy f*** n s****, Batman; The Joker's shooting The Riddler!
冷血 2024-10-24 16:05:52

这是 PHP 的 unicode 友好正则表达式。
正则表达式可以给你一个想法。

function do_something_except_first_letter($s) {
    // the following line SKIP the first character and pass it to callback func...
    // allows to keep the first letter even in words in quotes and brackets.
    // alternative regex is '/(?<!^|\s|\W)(\w)/u'.
    return preg_replace_callback('/(\B\w)/u', function($m) {
            // do what you need...
            // for example, lowercase all characters except the first letter
            return mb_strtolower($m[1]); 
        }, $s);
}

Here is unicode-friendly regular expression for PHP.
The regular expression can give you an idea.

function do_something_except_first_letter($s) {
    // the following line SKIP the first character and pass it to callback func...
    // allows to keep the first letter even in words in quotes and brackets.
    // alternative regex is '/(?<!^|\s|\W)(\w)/u'.
    return preg_replace_callback('/(\B\w)/u', function($m) {
            // do what you need...
            // for example, lowercase all characters except the first letter
            return mb_strtolower($m[1]); 
        }, $s);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文