使用 preg_replace 规范化文本匹配的字符串
我正在 MySQL 数据库中的一组名称和 CSV 文件中的一组字符串之间执行非常简单的文本匹配。在实际比较之前,我使用一系列选项运行 preg_replace 来规范化字符串。重要的替换之一是将不规则的缩写改为规则的完整单词。但我似乎无法捕捉像“Inc.”这样的缩写。和“公司”、“公司”。和“公司”,可能有也可能没有尾随期。
这是代码:
$patterns = array();
$patterns[0] = '/\s+/';
$patterns[1] = '/&/';
$patterns[2] = '/\bAssoc\.{0,1}\b/';
$patterns[3] = '/\bInc(?!\.)\b/';
$patterns[4] = '/\b(L\.?){2}P\.?/';
$patterns[5] = '/\bUniv(\s|\.)+\b/';
$patterns[6] = '/\bCorp\.?/';
$patterns[7] = '/\bAssn\.?/';
$patterns[8] = '/\bUnivesity\b/';
$patterns[9] = '/\bIntl.\b/';
$replacement = array();
$replacement[0] = ' ';
$replacement[1] = 'and';
$replacement[2] = 'Association';
$replacement[3] = 'Inc.';
$replacement[4] = '';
$replacement[5] = 'University';
$replacement[6] = 'Corporation';
$replacement[7] = 'Association';
$replacement[8] = 'University';
$replacement[9] = 'International';
$name = trim(preg_replace($patterns,$replacement,$name));
if(stristr($name,trim(preg_replace($patterns,$replacement,$org->org_name)))) return $org->org_id;
// code here
}
以下是一些不起作用的匹配(更多):
Haystack =>针
- “白羊座国际公司” => “白羊座国际机场”
- “菲尔普斯道奇公司” => “菲尔普斯·道奇公司”
- “麦克德莫特公司”=> “麦克德莫特公司”
据我所知,它没有赶上“Inc.”。和“Corp.”,至少不一致。有什么帮助吗?
I'm performing a pretty simple text matching between a set of names from my MySQL db and a set of strings from a CSV file. Before the actual comparison, I run preg_replace with an array of options to normalize the strings. One of the important replacements is changing irregular abbreviations into regular full words. But I can't seem to capture abbreviations like "Inc." and "Inc", "Corp." and "Corp" that may or may not have a trailing period.
Here is the code:
$patterns = array();
$patterns[0] = '/\s+/';
$patterns[1] = '/&/';
$patterns[2] = '/\bAssoc\.{0,1}\b/';
$patterns[3] = '/\bInc(?!\.)\b/';
$patterns[4] = '/\b(L\.?){2}P\.?/';
$patterns[5] = '/\bUniv(\s|\.)+\b/';
$patterns[6] = '/\bCorp\.?/';
$patterns[7] = '/\bAssn\.?/';
$patterns[8] = '/\bUnivesity\b/';
$patterns[9] = '/\bIntl.\b/';
$replacement = array();
$replacement[0] = ' ';
$replacement[1] = 'and';
$replacement[2] = 'Association';
$replacement[3] = 'Inc.';
$replacement[4] = '';
$replacement[5] = 'University';
$replacement[6] = 'Corporation';
$replacement[7] = 'Association';
$replacement[8] = 'University';
$replacement[9] = 'International';
$name = trim(preg_replace($patterns,$replacement,$name));
if(stristr($name,trim(preg_replace($patterns,$replacement,$org->org_name)))) return $org->org_id;
// code here
}
Here are some matches that aren't working (more to come):
Haystack => Needle
- "Aries International Inc." => "Aries Intl. Inc."
- "Phelps Dodge Corporation" => "Phelps Dodge Corp."
- "McDermott Incorporated" => "McDermott Inc."
As far as I can tell, it's not catching "Inc." and "Corp.", at least not consistently. Any help?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
将
\b
放在缩写后面,后跟一个可选的点,如下所示:Put the
\b
right after the abbreviation followed by a dot which is optional like so: