preg_match 将关键字变量与本地 UTF-8 编码文件中的拉丁和非拉丁字符关键字列表进行匹配
我有一个坏词过滤器,它使用保存在本地 UTF-8 编码文件中的关键字列表。该文件包括拉丁字符和非拉丁字符(主要是英语和阿拉伯语)。对于拉丁关键字,一切都按预期工作,但是当变量包含非拉丁字符时,匹配似乎无法识别这些现有关键字。
如何匹配拉丁语和非拉丁语关键字。
badwords.txt 文件每行包含一个单词,如本例所示
bad
nasty
racist
سفالة
وساخة
جنس
用于匹配的代码: <代码>
$badwords = file_get_contents("badwords.txt");
$badtemp = explode("\n", $badwords);
$badwords = array_unique($badtemp);
$hasBadword = 0;
$query = strtolower($query);
foreach ($badwords as $key => $val) {
if (!empty($val)) {
$val = trim($val);
$regexp = "/\b" . $val . "\b/i";
if (preg_match($regexp, $query))
$badFlag = 1;
if ($badFlag == 1) {
// Bad word detected die...
}
}
}
我读过 iconv、多字节函数(mbstring)和使用运算符 /u 可能会对此有所帮助,并且我尝试了一些方法,但似乎没有得到正确的结果。任何帮助解决这个问题,并让它匹配拉丁和非拉丁关键字的帮助将不胜感激。
I have a bad words filter that uses a list of keywords saved in a local UTF-8 encoded file. This file includes both Latin and non-Latin chars (mostly English and Arabic). Everything works as expected with Latin keywords, but when the variable includes non-Latin chars, the matching does not seem to recognize these existing keywords.
How do I go about matching both Latin and non-Latin keywords.
The badwords.txt file includes one word per line as in this example
bad
nasty
racist
سفالة
وساخة
جنس
Code used for matching:
$badwords = file_get_contents("badwords.txt"); $badtemp = explode("\n", $badwords); $badwords = array_unique($badtemp); $hasBadword = 0; $query = strtolower($query); foreach ($badwords as $key => $val) { if (!empty($val)) { $val = trim($val); $regexp = "/\b" . $val . "\b/i"; if (preg_match($regexp, $query)) $badFlag = 1; if ($badFlag == 1) { // Bad word detected die... } } }
I've read that iconv, multibyte functions (mbstring) and using the operator /u might help with this, and I tried a few things but do not seem to get it right. Any help would be much appreciated in resolving this, and having it match both Latin and non-Latin keywords.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这个问题似乎与识别单词边界有关; \b 结构显然不“支持 Unicode”。这就是问题 php regex word border matches in utf-8 的答案 似乎建议。即使使用 \b 时包含拉丁字母(如“é”)的文本,我也能够重现该问题。 方式设置和修改正则表达式时,问题似乎消失了(即,阿拉伯单词得到正确识别) :
当我按如下
The problem seems to relate to recognizing word boundaries; the \b construct is apparently not “Unicode aware.” This is what the answers to question php regex word boundary matching in utf-8 seem to suggest. I was able to reproduce the problem even with text containing Latin letters like “é” when \b was used. And the problem seems to disappear (i.e., Arabic words get correctly recognized) when I set
and modify the regexp as follows:
PHP 中的某些字符串函数不能用于 UTF-8 字符串,据说他们将在版本 6 中修复它,但现在您需要小心处理字符串。
看起来
strtolower()
就是其中之一,您需要使用mb_strtolower($query, 'UTF-8')
。如果这不能解决问题,您需要通读代码并找到处理$query
或badwords.txt
的每个点,并检查 UTF 文档-8 个错误。据我所知,
preg_match()
对于 UTF-8 字符串是可以的,但是默认情况下会禁用一些功能以提高性能。我认为你不需要其中任何一个。另请仔细检查
badwords.txt
是否为 UTF-8 文件,并且$query
是否包含有效的 UTF-8 字符串(如果它来自浏览器,则将其设置为带有标签)。
如果您尝试调试 UTF-8 文本,请记住大多数 Web 浏览器不会默认使用 UTF-8 文本编码,因此您打印出来用于调试的任何 PHP 变量都不会被浏览器正确显示,除非您选择 UTF- 8(在我的浏览器中,使用
View -> Encoding -> Unicode
)。您不需要使用 iconv 或任何其他转换 API,它们中的大多数都会简单地将所有非拉丁字符替换为拉丁字符。显然不是你想要的。
Some string functions in PHP cannot be used on UTF-8 strings, they're supposedly going to fix it in version 6, but for now you need to be careful what you do with a string.
It looks like
strtolower()
is one of them, you need to usemb_strtolower($query, 'UTF-8')
. If that doesn't fix it, you'll need to read through the code and find every point where you process$query
orbadwords.txt
and check the documentation for UTF-8 bugs.As far as I know,
preg_match()
is ok with UTF-8 strings, but there are some features disabled by default to improve performance. I don't think you need any of them.Please also double check that
badwords.txt
is a UTF-8 file and that$query
contains a valid UTF-8 string (if it's coming from the browser, you set it with a<meta>
tag).If you're trying to debug UTF-8 text, remember most web browsers do not default to the UTF-8 text encoding, so any PHP variable you print out for debugging will not be displayed correctly by the browser, unless you select UTF-8 (in my browser, with
View -> Encoding -> Unicode
).You shouldn't need to use
iconv
or any of the other conversion API's, most of them will simply replace all of the non-latin characters with latin ones. Obviously not what you want.