使用 preg_replace 规范化文本匹配的字符串

发布于 2024-10-31 01:06:24 字数 1283 浏览 3 评论 0原文

我正在 MySQL 数据库中的一组名称和 CSV 文件中的一组字符串之间执行非常简单的文本匹配。在实际比较之前,我使用一系列选项运行 preg_replace 来规范化字符串。重要的替换之一是将不规则的缩写改为规则的完整单词。但我似乎无法捕捉像“Inc.”这样的缩写。和“公司”、“公司”。和“公司”,可能有也可能没有尾随期。

这是代码:

$patterns = array();
$patterns[0] = '/\s+/';
$patterns[1] = '/&/';
$patterns[2] = '/\bAssoc\.{0,1}\b/';
$patterns[3] = '/\bInc(?!\.)\b/';
$patterns[4] = '/\b(L\.?){2}P\.?/';
$patterns[5] = '/\bUniv(\s|\.)+\b/';
$patterns[6] = '/\bCorp\.?/';
$patterns[7] = '/\bAssn\.?/';
$patterns[8] = '/\bUnivesity\b/';
$patterns[9] = '/\bIntl.\b/';

$replacement = array();
$replacement[0] = ' ';
$replacement[1] = 'and';
$replacement[2] = 'Association';
$replacement[3] = 'Inc.';
$replacement[4] = '';
$replacement[5] = 'University';
$replacement[6] = 'Corporation';
$replacement[7] = 'Association';
$replacement[8] = 'University';
$replacement[9] = 'International';

$name = trim(preg_replace($patterns,$replacement,$name));
if(stristr($name,trim(preg_replace($patterns,$replacement,$org->org_name)))) return $org->org_id;
// code here
}

以下是一些不起作用的匹配(更多):

Haystack =>针

  • “白羊座国际公司” => “白羊座国际机场”
  • “菲尔普斯道奇公司” => “菲尔普斯·道奇公司”
  • “麦克德莫特公司”=> “麦克德莫特公司”

据我所知,它没有赶上“Inc.”。和“Corp.”,至少不一致。有什么帮助吗?

I'm performing a pretty simple text matching between a set of names from my MySQL db and a set of strings from a CSV file. Before the actual comparison, I run preg_replace with an array of options to normalize the strings. One of the important replacements is changing irregular abbreviations into regular full words. But I can't seem to capture abbreviations like "Inc." and "Inc", "Corp." and "Corp" that may or may not have a trailing period.

Here is the code:

$patterns = array();
$patterns[0] = '/\s+/';
$patterns[1] = '/&/';
$patterns[2] = '/\bAssoc\.{0,1}\b/';
$patterns[3] = '/\bInc(?!\.)\b/';
$patterns[4] = '/\b(L\.?){2}P\.?/';
$patterns[5] = '/\bUniv(\s|\.)+\b/';
$patterns[6] = '/\bCorp\.?/';
$patterns[7] = '/\bAssn\.?/';
$patterns[8] = '/\bUnivesity\b/';
$patterns[9] = '/\bIntl.\b/';

$replacement = array();
$replacement[0] = ' ';
$replacement[1] = 'and';
$replacement[2] = 'Association';
$replacement[3] = 'Inc.';
$replacement[4] = '';
$replacement[5] = 'University';
$replacement[6] = 'Corporation';
$replacement[7] = 'Association';
$replacement[8] = 'University';
$replacement[9] = 'International';

$name = trim(preg_replace($patterns,$replacement,$name));
if(stristr($name,trim(preg_replace($patterns,$replacement,$org->org_name)))) return $org->org_id;
// code here
}

Here are some matches that aren't working (more to come):

Haystack => Needle

  • "Aries International Inc." => "Aries Intl. Inc."
  • "Phelps Dodge Corporation" => "Phelps Dodge Corp."
  • "McDermott Incorporated" => "McDermott Inc."

As far as I can tell, it's not catching "Inc." and "Corp.", at least not consistently. Any help?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

请爱~陌生人 2024-11-07 01:06:24

\b 放在缩写后面,后跟一个可选的点,如下所示:

$patterns[2] = '/\bAssoc\b\.?/';

Put the \b right after the abbreviation followed by a dot which is optional like so:

$patterns[2] = '/\bAssoc\b\.?/';
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文