preg_match 将关键字变量与本地 UTF-8 编码文件中的拉丁和非拉丁字符关键字列表进行匹配

发布于 2024-12-22 18:01:36 字数 973 浏览 7 评论 0原文

我有一个坏词过滤器，它使用保存在本地 UTF-8 编码文件中的关键字列表。该文件包括拉丁字符和非拉丁字符（主要是英语和阿拉伯语）。对于拉丁关键字，一切都按预期工作，但是当变量包含非拉丁字符时，匹配似乎无法识别这些现有关键字。

如何匹配拉丁语和非拉丁语关键字。

badwords.txt 文件每行包含一个单词，如本例所示

bad

nasty

racist

سفالة

وساخة

جنس

用于匹配的代码： <代码>




$badwords = file_get_contents("badwords.txt");
$badtemp = explode("\n", $badwords);
$badwords = array_unique($badtemp);
$hasBadword = 0;
$query = strtolower($query);

foreach ($badwords as $key => $val) {
    if (!empty($val)) {
        $val = trim($val);
        $regexp = "/\b" . $val . "\b/i";
        if (preg_match($regexp, $query))
            $badFlag = 1;

        if ($badFlag == 1) {
           // Bad word detected die...
        }
    }
}

我读过 iconv、多字节函数（mbstring）和使用运算符 /u 可能会对此有所帮助，并且我尝试了一些方法，但似乎没有得到正确的结果。任何帮助解决这个问题，并让它匹配拉丁和非拉丁关键字的帮助将不胜感激。

原文

I have a bad words filter that uses a list of keywords saved in a local UTF-8 encoded file. This file includes both Latin and non-Latin chars (mostly English and Arabic). Everything works as expected with Latin keywords, but when the variable includes non-Latin chars, the matching does not seem to recognize these existing keywords.

How do I go about matching both Latin and non-Latin keywords.

The badwords.txt file includes one word per line as in this example

bad

nasty

racist

سفالة

وساخة

جنس

Code used for matching:



$badwords = file_get_contents("badwords.txt");
$badtemp = explode("\n", $badwords);
$badwords = array_unique($badtemp);
$hasBadword = 0;
$query = strtolower($query);

foreach ($badwords as $key => $val) {
    if (!empty($val)) {
        $val = trim($val);
        $regexp = "/\b" . $val . "\b/i";
        if (preg_match($regexp, $query))
            $badFlag = 1;

        if ($badFlag == 1) {
           // Bad word detected die...
        }
    }
}


I've read that iconv, multibyte functions (mbstring) and using the operator /u might help with this, and I tried a few things but do not seem to get it right. Any help would be much appreciated in resolving this, and having it match both Latin and non-Latin keywords.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风和你 2024-12-29 18:01:36

这个问题似乎与识别单词边界有关； \b 结构显然不“支持 Unicode”。这就是问题 php regex word border matches in utf-8 的答案似乎建议。即使使用 \b 时包含拉丁字母（如“é”）的文本，我也能够重现该问题。方式设置和修改正则表达式时，问题似乎消失了（即，阿拉伯单词得到正确识别）：

$wstart = '(^|[^\p{L}])';
$wend = '([^\p{L}]|$)';

当我按如下

$regexp = "/" . $wstart . $val . $wend . "/iu";

The problem seems to relate to recognizing word boundaries; the \b construct is apparently not “Unicode aware.” This is what the answers to question php regex word boundary matching in utf-8 seem to suggest. I was able to reproduce the problem even with text containing Latin letters like “é” when \b was used. And the problem seems to disappear (i.e., Arabic words get correctly recognized) when I set

$wstart = '(^|[^\p{L}])';
$wend = '([^\p{L}]|$)';

and modify the regexp as follows:

$regexp = "/" . $wstart . $val . $wend . "/iu";

回复收藏 0 原文

诗化ㄋ丶相逢 2024-12-29 18:01:36

PHP 中的某些字符串函数不能用于 UTF-8 字符串，据说他们将在版本 6 中修复它，但现在您需要小心处理字符串。

看起来 strtolower() 就是其中之一，您需要使用 mb_strtolower($query, 'UTF-8')。如果这不能解决问题，您需要通读代码并找到处理 $query 或 badwords.txt 的每个点，并检查 UTF 文档-8 个错误。

据我所知，preg_match() 对于 UTF-8 字符串是可以的，但是默认情况下会禁用一些功能以提高性能。我认为你不需要其中任何一个。

另请仔细检查 badwords.txt 是否为 UTF-8 文件，并且 $query 是否包含有效的 UTF-8 字符串（如果它来自浏览器，则将其设置为带有标签）。

如果您尝试调试 UTF-8 文本，请记住大多数 Web 浏览器不会默认使用 UTF-8 文本编码，因此您打印出来用于调试的任何 PHP 变量都不会被浏览器正确显示，除非您选择 UTF- 8（在我的浏览器中，使用 View -> Encoding -> Unicode）。

您不需要使用 iconv 或任何其他转换 API，它们中的大多数都会简单地将所有非拉丁字符替换为拉丁字符。显然不是你想要的。

回复收藏 0 原文

~没有更多了~

关于作者

别再吹冷风

暂无简介

文章

383 人气

关注发私信

友情链接

文江博客

preg_match 将关键字变量与本地 UTF-8 编码文件中的拉丁和非拉丁字符关键字列表进行匹配

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

preg_match 将关键字变量与本地 UTF-8 编码文件中的拉丁和非拉丁字符关键字列表进行匹配

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。