从字符串中提取企业名称和时间段

发布于 2024-12-09 13:44:47 字数 861 浏览 1 评论 0原文

我正在使用 Python 从路透社提取有关某些公司的信息。我已经能够从获取高管/高管姓名、简历和薪酬本页

现在，我想从传记部分提取以前的职位名称和公司，如下所示：

先生。 Donald T. Grimes 自 2008 年 5 月起担任 Wolverine World Wide, Inc. 的高级副总裁、首席财务官兼财务主管。2007 年至 2008 年，他担任 Keystone Automotive Operations, Inc.（一家汽车公司）的执行副总裁兼首席财务官。汽车配件和设备的经销商。在加入 Keystone 之前，Grimes 先生曾在 Brown-Forman Corporation（一家优质葡萄酒和烈酒制造商和营销商）担任过一系列高级企业和部门财务职务。在 Brown-Forman 任职期间，Grimes 先生于 2006 年至 2007 年担任副总裁兼饮料财务总监； 2003年至2006年担任副总裁、企业规划与分析总监； 1999 年至 2003 年担任 Brown-Forman Spirits America 高级副总裁兼首席财务官。

我可以使用简单的正则表达式来获取年份和年份，但我不知道如何编写正则表达式来获取标题和公司名字也是如此。我知道字符串格式不一致，所以我会采用至少适用于 70% 情况的答案。这是我想要的输出：

2007-2008, executive vice president and chief financial officer, Keystone Automotive operations

原文

I am extracting information about certain companies from Reuters using Python. I have been able to get the officer/executive names, biographies, and compensation from this page

Now, I want to extract previous position titles and companies from the biography section, which looks something like this:

Mr. Donald T. Grimes is Senior Vice President, Chief Financial Officer and Treasurer of Wolverine World Wide, Inc., since May 2008. From 2007 to 2008, he was the Executive Vice President and Chief Financial Officer for Keystone Automotive Operations, Inc., a distributor of automotive accessories and equipment. Prior to Keystone, Mr. Grimes held a series of senior corporate and divisional finance roles at Brown-Forman Corporation, a manufacturer and marketer of premium wines and spirits. During his employment at Brown-Forman, Mr. Grimes was Vice President, Director of Beverage Finance from 2006 to 2007; Vice President, Director of Corporate Planning and Analysis from 2003 to 2006; and Senior Vice President, Chief Financial Officer of Brown-Forman Spirits America from 1999 to 2003.

I can use simple regex to get the from and to years, but I am at a loss on how to write regex to get the titles and the company name as well. I know the string format is inconsistent, so I would take an answer that works for at least 70% of cases. Here's the output I would like:

2007-2008, executive vice president and chief financial officer, Keystone Automotive operations

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

灼疼热情 2024-12-16 13:44:47

您试图解决的问题是众所周知的和经过研究的，如果您搜索术语“命名实体提取”和“关系提取”，您会发现大量描述方法和算法的研究论文一些好的起点是：

< 《Python 自然语言处理》一书的第 7 章，事实上整本书可能会有帮助。在线章节
本文介绍“使用维基百科进行命名实体关系挖掘”
这篇论文《dd关系挖掘的新颖算法，其中描述了挖掘职务和组织作为示例之一。

这些只是我发现有趣的几个链接，还有很多而且可能比这些更好的链接，但这应该可以帮助您入门。

回复收藏 0 原文

伴我心暖 2024-12-16 13:44:47

我认为不会有任何一个正则表达式可以用于此目的，除非它真的很糟糕。我认为解决这个问题的方法可能是自然语言处理。当然有一些软件包可以实现这一点，但使用它们可能并不简单。

本质上，您想要使用像“X is/was Y”这样的句子，并找出哪部分是名称，哪部分是职位列表，以及哪部分不相关。也许寻找大写单词或小写单词（例如“and”和“of”）的单词序列？

(?:\u\w+)( (?:\u\w*)|(?:of)|(?:and))*  #Note the space

\u 表示下一个字符（\w+ 组的第一个字符）为大写。还没有测试过，但看起来应该可以。这可能是一个不小的问题。

I don't think there is going to be a single regex that you can use for this, unless it's really nasty. I think the solution to this might be Natural Language Processing. Certainly there are packages for this, but using them might not be simple.

Essentially you want to take a sentence like "X is/was Y", and figure out which part is a name, which part is a list of job titles, and which parts are irrelevant. Maybe look for sequences of words that are either capitalized or small words like "and" and "of"?

(?:\u\w+)( (?:\u\w*)|(?:of)|(?:and))*  #Note the space

The \u means that the next single character (the first character of the \w+ group) is uppercase. Haven't tested it, but it seems like it should work. This may be a non-trivial problem.

回复收藏 0 原文

~没有更多了~