使用正则表达式 Python 提取学位差异之间的名称
此代码无法从学术学位之间提取完整姓名,例如 Dr. Richard, MM 或 Dr. Bobby Richard Klaus, MM 或 Richar, MM。学位不仅有Dr.,还有Dr.、Dra.、Prof.、Drs、Prof. Dr.、M.Ag和ME。
输出将是这样的
目标结果
完整 | 姓名 (?) |
---|---|
Dr. RICHARD, MM | Richard |
Dra。 BOBBY Richard Klaus、MM | Bobby Richard Klaus |
Richard、MM | Richard |
但实际上,结果预计会像这样
实际结果
完整姓名 | 名称 |
---|---|
Dr. Richard、MM | Richard |
Dra。 Bobby Richard Klaus, MM | Richard Klaus |
Richard, MM | Richard, MM |
使用此代码
def extract_names(text):
""" fix capitalize """
text = re.sub(r"(_|-)+"," ", text).title()
""" find name between whitespace and comma """
text = re.findall("\s[A-Z]\w+(?:\s[A-Z]\w+?)?\s(?:[A-Z]\w+?)?[\s\.\,\;\:]", text)
text = ' '.join(text[0].split(","))
则还有另一个问题,错误
11 text = ' '.join(text[0].split(",")) 12 返回文本 13 # def extract_names(文本):
索引错误:列表索引超出范围
This code is having trouble extracting complete names from between academic degrees, for example, Dr. Richard, MM or Dr. Bobby Richard Klaus, MM or Richar, MM. The academic degrees is not only Dr but also Dr., Dra., Prof., Drs, Prof. Dr., M.Ag and ME.
The output would be like this
The Goal Result
Complete Names | Names (?) |
---|---|
Dr. RICHARD, MM | Richard |
Dra. BOBBY Richard Klaus, MM | Bobby Richard Klaus |
Richard, MM | Richard |
but actually, the result is expected to like this
Actual Result
Complete Names | Names |
---|---|
Dr. Richard, MM | Richard |
Dra. Bobby Richard Klaus, MM | Richard Klaus |
Richard, MM | Richard, MM |
with this code
def extract_names(text):
""" fix capitalize """
text = re.sub(r"(_|-)+"," ", text).title()
""" find name between whitespace and comma """
text = re.findall("\s[A-Z]\w+(?:\s[A-Z]\w+?)?\s(?:[A-Z]\w+?)?[\s\.\,\;\:]", text)
text = ' '.join(text[0].split(","))
then there is another problem, error
11 text = ' '.join(text[0].split(","))
12 return text
13 # def extract_names(text):
IndexError: list index out of range
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用
查看正则表达式演示。
(?:Dr[sa]?|Prof|M\.Ag|M[EM])\.?
模式匹配Dr
、Drs
、Dra
、Prof
、M.Ag
、ME
、MM
(可选)带有.
。^(?:\s*{ads})+\s*|\s*,(?:\s*{ads})+$
主模式与^(?: \s*{ads})+\s*
- 字符串开头,然后是一个或多个由零个或多个空格组成的序列和ads
模式,然后是零个或多个空格|
- 或\s*,
- 零个或多个空格和逗号(?:\s*{ads})+
- 零个或多个空格和的一次或多次重复>ads
模式$
- 字符串结尾You can use
See the regex demo.
The
(?:Dr[sa]?|Prof|M\.Ag|M[EM])\.?
pattern matchesDr
,Drs
,Dra
,Prof
,M.Ag
,ME
,MM
optionally followed with a.
.The
^(?:\s*{ads})+\s*|\s*,(?:\s*{ads})+$
main pattern matches^(?:\s*{ads})+\s*
- start of string, then one or more sequences of zero or more whitespaces andads
pattern and then zero or more whitespaces|
- or\s*,
- zero or more whitespaces and a comma(?:\s*{ads})+
- one or more repetitions of zero or more whitespaces andads
pattern$
- end of string