从字符串中提取企业名称和时间段
我正在使用 Python 从路透社提取有关某些公司的信息。我已经能够从 获取高管/高管姓名、简历和薪酬本页
现在,我想从传记部分提取以前的职位名称和公司,如下所示:
先生。 Donald T. Grimes 自 2008 年 5 月起担任 Wolverine World Wide, Inc. 的高级副总裁、首席财务官兼财务主管。2007 年至 2008 年,他担任 Keystone Automotive Operations, Inc.(一家汽车公司)的执行副总裁兼首席财务官。汽车配件和设备的经销商。在加入 Keystone 之前,Grimes 先生曾在 Brown-Forman Corporation(一家优质葡萄酒和烈酒制造商和营销商)担任过一系列高级企业和部门财务职务。在 Brown-Forman 任职期间,Grimes 先生于 2006 年至 2007 年担任副总裁兼饮料财务总监; 2003年至2006年担任副总裁、企业规划与分析总监; 1999 年至 2003 年担任 Brown-Forman Spirits America 高级副总裁兼首席财务官。
我可以使用简单的正则表达式来获取年份和年份,但我不知道如何编写正则表达式来获取标题和公司名字也是如此。我知道字符串格式不一致,所以我会采用至少适用于 70% 情况的答案。这是我想要的输出:
2007-2008, executive vice president and chief financial officer, Keystone Automotive operations
I am extracting information about certain companies from Reuters using Python. I have been able to get the officer/executive names, biographies, and compensation from this page
Now, I want to extract previous position titles and companies from the biography section, which looks something like this:
Mr. Donald T. Grimes is Senior Vice President, Chief Financial Officer and Treasurer of Wolverine World Wide, Inc., since May 2008. From 2007 to 2008, he was the Executive Vice President and Chief Financial Officer for Keystone Automotive Operations, Inc., a distributor of automotive accessories and equipment. Prior to Keystone, Mr. Grimes held a series of senior corporate and divisional finance roles at Brown-Forman Corporation, a manufacturer and marketer of premium wines and spirits. During his employment at Brown-Forman, Mr. Grimes was Vice President, Director of Beverage Finance from 2006 to 2007; Vice President, Director of Corporate Planning and Analysis from 2003 to 2006; and Senior Vice President, Chief Financial Officer of Brown-Forman Spirits America from 1999 to 2003.
I can use simple regex to get the from and to years, but I am at a loss on how to write regex to get the titles and the company name as well. I know the string format is inconsistent, so I would take an answer that works for at least 70% of cases. Here's the output I would like:
2007-2008, executive vice president and chief financial officer, Keystone Automotive operations
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您试图解决的问题是众所周知的和经过研究的,如果您搜索术语“命名实体提取”和“关系提取”,您会发现大量描述方法和算法的研究论文一些好的起点是:
本文介绍“使用维基百科进行命名实体关系挖掘”
这篇论文《dd关系挖掘的新颖算法,其中描述了挖掘职务和组织作为示例之一。
这些只是我发现有趣的几个链接,还有很多而且可能比这些更好的链接,但这应该可以帮助您入门。
The problem you are trying to solve is well known and researched, and you will find a large amount of research paper describing approaches and algorithms if you google for the terms "Named Entity Extraction" and "Relationship Extraction" Some good starting points are:
Chapter 7 of the book "Natural Language Processing with Python", in fact that entire book would probably be helpful. Chapter online here
This paper on "Named Entity Relation Mining using Wikipedia"
This paper "ddNovel Algorithms for Relationship Mining which describes mining job titles and organizations as one of the examples.
These are just a few links I've found interesting, there are a ton more and probably better ones than these, but this should get you started.
我认为不会有任何一个正则表达式可以用于此目的,除非它真的很糟糕。我认为解决这个问题的方法可能是自然语言处理。当然有一些软件包可以实现这一点,但使用它们可能并不简单。
本质上,您想要使用像“X is/was Y”这样的句子,并找出哪部分是名称,哪部分是职位列表,以及哪部分不相关。也许寻找大写单词或小写单词(例如“and”和“of”)的单词序列?
\u
表示下一个字符(\w+
组的第一个字符)为大写。还没有测试过,但看起来应该可以。这可能是一个不小的问题。I don't think there is going to be a single regex that you can use for this, unless it's really nasty. I think the solution to this might be Natural Language Processing. Certainly there are packages for this, but using them might not be simple.
Essentially you want to take a sentence like "X is/was Y", and figure out which part is a name, which part is a list of job titles, and which parts are irrelevant. Maybe look for sequences of words that are either capitalized or small words like "and" and "of"?
The
\u
means that the next single character (the first character of the\w+
group) is uppercase. Haven't tested it, but it seems like it should work. This may be a non-trivial problem.