具有 Python / PHP 预设名称列表的命名实体识别

发布于 2024-10-02 14:22:37 字数 1033 浏览 1 评论 0原文

我正在尝试处理一个 CSV 文件,该文件的每一行都有一个文本字段,其中包含组织名称和该组织中个人的职位作为非结构化文本。该字段通常是一团乱七八糟的文本,如下所示:

Assoc. Research Professor  Dept. Psychology  Univ. California  Santa Barbara

我需要提取职位和组织名称。对于这个职位,我使用 preg_match 来匹配不同职业的一系列大约 60 种不同的正则表达式,我认为它效果很好(我的猜测是它捕获了大约 80%)。但是,我在获取组织名称时遇到了困难。我有一个包含大约 16,000 个组织名称的 MySQL 表,我可以对其执行简单的 preg_match,但由于常见的拼写错误和缩写,它只能捕获大约 30% 的组织。例如,我的数据库有

University of California Santa Barbara

但 CSV 文件可能有以下任一选项:

Univ Cal Santa Barbara
University Cal-Santa Barbara
University California-Santa Barbara
Cal University, Santa Barbara

我需要处理数十万条记录,并且我无法花时间来纠正当前未正确或费力处理的 70% 的记录为每个组织创建多个别名。我希望能够做的是捕捉细微的差异(例如小的拼写错误、连字符与空格以及常见缩写),并且如果仍然没有找到匹配项,则最好识别组织名称并创建新记录为了它。

  • Python 或 PHP 中的哪些库或工具可以执行具有更广泛影响范围的相似性匹配?
  • Python 中的 NLTK 会发现拼写错误吗?
  • 是否可以使用 AlchemyAPI 来捕获拼写错误的组织?到目前为止,我只能使用它来捕获拼写正确的组织,
  • 因为我正在将一个短字符串(组织名称)与一个较长的字符串(包括名称和无关信息)进行比较,那么使用PHP的similar_text是否有希望功能?

任何帮助或见解将不胜感激。

I'm trying to process a CSV file that has as in each row a text field with the name of organization and position of an individual within that organization as unstructured text. This field is usually a mess of text like this:

Assoc. Research Professor  Dept. Psychology  Univ. California  Santa Barbara

I need to pull out the position and the organization name. For the position, I use preg_match for a series of about 60 different regular expressions for the different professions, and I think it works pretty well (my guess is that it catches about 80%). But, I'm having trouble catching the organization name. I have a MySQL table with roughly 16,000 organization names that I can perform a simple preg_match for, but due to common misspellings and abbreviations, it's only catching about 30% of the organizations. For example, my database has

University of California Santa Barbara

But the CSV file might have any of the options:

Univ Cal Santa Barbara
University Cal-Santa Barbara
University California-Santa Barbara
Cal University, Santa Barbara

I need to process several hundred thousand records, and I can't spend the time to correct 70% of the records that are currently not being processed correctly or painstakingly create multiple aliases for each organization. What I would like to be able to do is to catch small differences (such as the small misspellings, hyphens versus spaces, and common abbreviations), and, if still no matches are found, to ideally recognize an organizational name and create a new record for it.

  • What libraries or tools in Python or PHP would allow to perform a similarity match that would have a broader reach?
  • Would NLTK in Python catch misspellings?
  • Is it possible to use AlchemyAPI to catch misspelled organizations? So far I've only been able to use it to catch correctly spelled organizations
  • Since I'm comparing a short string (the organization name) to a longer string (that includes the name plus extraneous information) is there any hope in using PHP's similar_text function?

Any help or insight would be appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

花之痕靓丽 2024-10-09 14:22:37

您可以使用 difflib 来计算相似度CSV 输入和规范拼写之间的差异,如果高于某个阈值(例如 0.65),则将其视为匹配。

例如:

import difflib

exact = 'University of California Santa Barbara'

inputs = ['Univ Cal Santa Barbara',
          'University Cal-Santa Barbara',
          'University California-Santa Barbara',
          'Cal University, Santa Barbara',
          'Canterbury University']

sm = difflib.SequenceMatcher(None, exact)
ratios = []
for input in inputs:
    sm.set_seq2(input)
    ratios.append(sm.ratio())

print ratios

给出:

[0.73333333333333328, 0.81818181818181823, 0.93150684931506844,
 0.71641791044776115, 0.33898305084745761]

请注意“坎特伯雷大学”的匹配率()比您给出的输入低得多。

不过,SequenceMatcher.ratio() 在计算 16,000 个值时可能会太慢。

You may be able to use difflib to calculate the similarity ratio between the CSV input and the canonical spelling, and consider it a match if it's above a certain threshold (say, 0.65).

For example:

import difflib

exact = 'University of California Santa Barbara'

inputs = ['Univ Cal Santa Barbara',
          'University Cal-Santa Barbara',
          'University California-Santa Barbara',
          'Cal University, Santa Barbara',
          'Canterbury University']

sm = difflib.SequenceMatcher(None, exact)
ratios = []
for input in inputs:
    sm.set_seq2(input)
    ratios.append(sm.ratio())

print ratios

gives:

[0.73333333333333328, 0.81818181818181823, 0.93150684931506844,
 0.71641791044776115, 0.33898305084745761]

Note how 'Canterbury University' has a much lower match ratio() than the inputs you gave.

Then again, SequenceMatcher.ratio() may be too slow computed over 16,000 values.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文