PHP Regex:搜索英语和阿拉伯语文本的文章
我正在搜索英语和阿拉伯语关键字的文章。 这些文章可以是英语或阿拉伯语。
我当前的代码是:
$k = implode("|", $keywords);
$regexp = "/(?i)\b(".$k.")\b/";
preg_match_all( $regexp, $content, $matches );
但由于某种原因,这在阿拉伯语文章中找不到关键字。我已经验证关键字和文章都被正确阅读;没有编码问题。
我可以做什么来解决这个问题?请注意,我无法检测文章或关键字是英语还是阿拉伯语,因此必须有一个正则表达式来匹配它们。
I'm searching articles for keywords which are in both English and Arabic.
The articles can be either in English or Arabic.
My current code is:
$k = implode("|", $keywords);
$regexp = "/(?i)\b(".$k.")\b/";
preg_match_all( $regexp, $content, $matches );
But this doesn't find keywords in Arabic articles for some reason. I've verified that both the keywords and articles are being read correctly; no encoding issues.
What can I do to fix this? Note that there is no way for me to detect whether an article or keyword is in English or Arabic, so there has to be a single regex to match them all.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的正则表达式可能只是缺少
/u
nicode 标志:否则 PCRE 必须比较字节。在这种情况下,它可能仍然能够找到单词(当 UTF-8 编码相同时),但永远不会检测到单词
\b
oundaries。更新
好的
\b
实际上只检测\w
边界(因此取决于区域设置而不是 /u 标志)。然后尝试使用断言:Your regex might simply lack the
/u
nicode flag:Otherwise PCRE has to compare bytes. In that case it might still be able to find the words (when the UTF-8 encoding is identical), but won't ever detect the word
\b
oundaries.Update
Okay
\b
really only detects\w
boundaries (so depends on the locale setting instead of /u flag). Then try this instead, which uses assertions: