具有 Python / PHP 预设名称列表的命名实体识别
我正在尝试处理一个 CSV 文件,该文件的每一行都有一个文本字段,其中包含组织名称和该组织中个人的职位作为非结构化文本。该字段通常是一团乱七八糟的文本,如下所示:
Assoc. Research Professor Dept. Psychology Univ. California Santa Barbara
我需要提取职位和组织名称。对于这个职位,我使用 preg_match 来匹配不同职业的一系列大约 60 种不同的正则表达式,我认为它效果很好(我的猜测是它捕获了大约 80%)。但是,我在获取组织名称时遇到了困难。我有一个包含大约 16,000 个组织名称的 MySQL 表,我可以对其执行简单的 preg_match,但由于常见的拼写错误和缩写,它只能捕获大约 30% 的组织。例如,我的数据库有
University of California Santa Barbara
但 CSV 文件可能有以下任一选项:
Univ Cal Santa Barbara
University Cal-Santa Barbara
University California-Santa Barbara
Cal University, Santa Barbara
我需要处理数十万条记录,并且我无法花时间来纠正当前未正确或费力处理的 70% 的记录为每个组织创建多个别名。我希望能够做的是捕捉细微的差异(例如小的拼写错误、连字符与空格以及常见缩写),并且如果仍然没有找到匹配项,则最好识别组织名称并创建新记录为了它。
- Python 或 PHP 中的哪些库或工具可以执行具有更广泛影响范围的相似性匹配?
- Python 中的 NLTK 会发现拼写错误吗?
- 是否可以使用 AlchemyAPI 来捕获拼写错误的组织?到目前为止,我只能使用它来捕获拼写正确的组织,
- 因为我正在将一个短字符串(组织名称)与一个较长的字符串(包括名称和无关信息)进行比较,那么使用PHP的similar_text是否有希望功能?
任何帮助或见解将不胜感激。
I'm trying to process a CSV file that has as in each row a text field with the name of organization and position of an individual within that organization as unstructured text. This field is usually a mess of text like this:
Assoc. Research Professor Dept. Psychology Univ. California Santa Barbara
I need to pull out the position and the organization name. For the position, I use preg_match for a series of about 60 different regular expressions for the different professions, and I think it works pretty well (my guess is that it catches about 80%). But, I'm having trouble catching the organization name. I have a MySQL table with roughly 16,000 organization names that I can perform a simple preg_match for, but due to common misspellings and abbreviations, it's only catching about 30% of the organizations. For example, my database has
University of California Santa Barbara
But the CSV file might have any of the options:
Univ Cal Santa Barbara
University Cal-Santa Barbara
University California-Santa Barbara
Cal University, Santa Barbara
I need to process several hundred thousand records, and I can't spend the time to correct 70% of the records that are currently not being processed correctly or painstakingly create multiple aliases for each organization. What I would like to be able to do is to catch small differences (such as the small misspellings, hyphens versus spaces, and common abbreviations), and, if still no matches are found, to ideally recognize an organizational name and create a new record for it.
- What libraries or tools in Python or PHP would allow to perform a similarity match that would have a broader reach?
- Would NLTK in Python catch misspellings?
- Is it possible to use AlchemyAPI to catch misspelled organizations? So far I've only been able to use it to catch correctly spelled organizations
- Since I'm comparing a short string (the organization name) to a longer string (that includes the name plus extraneous information) is there any hope in using PHP's similar_text function?
Any help or insight would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这属于模糊逻辑的范畴。看看这些是否有任何帮助:
http://www .phpclasses.org/blog/post/119-Neural-Networks-in-PHP.html
http://ann.thwien.de/index.php/安装
This is within the domain of fuzzy logic. See if these are of any help:
http://www.phpclasses.org/blog/post/119-Neural-Networks-in-PHP.html
http://ann.thwien.de/index.php/Installation
您可以使用
difflib
来计算相似度CSV 输入和规范拼写之间的差异,如果高于某个阈值(例如 0.65),则将其视为匹配。例如:
给出:
请注意“坎特伯雷大学”的匹配率()比您给出的输入低得多。
不过,SequenceMatcher.ratio() 在计算 16,000 个值时可能会太慢。
You may be able to use
difflib
to calculate the similarity ratio between the CSV input and the canonical spelling, and consider it a match if it's above a certain threshold (say, 0.65).For example:
gives:
Note how 'Canterbury University' has a much lower match ratio() than the inputs you gave.
Then again, SequenceMatcher.ratio() may be too slow computed over 16,000 values.