使用正则表达式 Python 提取学位差异之间的名称

发布于 2025-01-09 09:13:48 字数 1295 浏览 0 评论 0原文

此代码无法从学术学位之间提取完整姓名，例如 Dr. Richard, MM 或 Dr. Bobby Richard Klaus, MM 或 Richar, MM。学位不仅有Dr.，还有Dr.、Dra.、Prof.、Drs、Prof. Dr.、M.Ag和ME。

输出将是这样的

目标结果

完整	姓名 (?)
Dr. RICHARD, MM	Richard
Dra。 BOBBY Richard Klaus、MM	Bobby Richard Klaus
Richard、MM	Richard

但实际上，结果预计会像这样

实际结果

完整姓名	名称
Dr. Richard、MM	Richard
Dra。 Bobby Richard Klaus, MM	Richard Klaus
Richard, MM	Richard, MM

使用此代码

def extract_names(text):
   """ fix capitalize """
   text = re.sub(r"(_|-)+"," ", text).title()
   """ find name between whitespace and comma """
   text = re.findall("\s[A-Z]\w+(?:\s[A-Z]\w+?)?\s(?:[A-Z]\w+?)?[\s\.\,\;\:]", text)
   text = ' '.join(text[0].split(","))

则还有另一个问题，错误

11 text = ' '.join(text[0].split(",")) 12 返回文本 13 # def extract_names(文本):

索引错误：列表索引超出范围

原文

This code is having trouble extracting complete names from between academic degrees, for example, Dr. Richard, MM or Dr. Bobby Richard Klaus, MM or Richar, MM. The academic degrees is not only Dr but also Dr., Dra., Prof., Drs, Prof. Dr., M.Ag and ME.

The output would be like this

The Goal Result

Complete Names	Names (?)
Dr. RICHARD, MM	Richard
Dra. BOBBY Richard Klaus, MM	Bobby Richard Klaus
Richard, MM	Richard

but actually, the result is expected to like this

Actual Result

Complete Names	Names
Dr. Richard, MM	Richard
Dra. Bobby Richard Klaus, MM	Richard Klaus
Richard, MM	Richard, MM

with this code

def extract_names(text):
   """ fix capitalize """
   text = re.sub(r"(_|-)+"," ", text).title()
   """ find name between whitespace and comma """
   text = re.findall("\s[A-Z]\w+(?:\s[A-Z]\w+?)?\s(?:[A-Z]\w+?)?[\s\.\,\;\:]", text)
   text = ' '.join(text[0].split(","))

then there is another problem, error

11 text = ' '.join(text[0].split(","))
12 return text
13 # def extract_names(text):

IndexError: list index out of range

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

流星番茄 2025-01-16 09:13:48

您可以使用

ads = r'(?:Dr[sa]?|Prof|M\.Ag|M[EM])\.?'
result = re.sub(fr'^(?:\s*{ads})+\s*|\s*,(?:\s*{ads})+
查看正则表达式演示。
 (?:Dr[sa]?|Prof|M\.Ag|M[EM])\.? 模式匹配 Dr、Drs、Dra、Prof、M.Ag、ME、MM（可选）带有 .。
 ^(?:\s*{ads})+\s*|\s*,(?:\s*{ads})+$ 主模式与

^(?: \s*{ads})+\s* - 字符串开头，然后是一个或多个由零个或多个空格组成的序列和 ads 模式，然后是零个或多个空格
|  - 或
\s*, - 零个或多个空格和逗号
(?:\s*{ads})+ - 零个或多个空格和  的一次或多次重复>ads 模式
$ - 字符串结尾

, '', text, flags=re.I)

查看正则表达式演示。

(?:Dr[sa]?|Prof|M\.Ag|M[EM])\.? 模式匹配 Dr、Drs、Dra、Prof、M.Ag、ME、MM（可选）带有 .。

^(?:\s*{ads})+\s*|\s*,(?:\s*{ads})+$ 主模式与

^(?: \s*{ads})+\s* - 字符串开头，然后是一个或多个由零个或多个空格组成的序列和 ads 模式，然后是零个或多个空格
| - 或
\s*, - 零个或多个空格和逗号
(?:\s*{ads})+ - 零个或多个空格和 的一次或多次重复>ads 模式
$ - 字符串结尾

You can use

ads = r'(?:Dr[sa]?|Prof|M\.Ag|M[EM])\.?'
result = re.sub(fr'^(?:\s*{ads})+\s*|\s*,(?:\s*{ads})+
See the regex demo.
The (?:Dr[sa]?|Prof|M\.Ag|M[EM])\.? pattern matches Dr, Drs, Dra, Prof, M.Ag, ME, MM optionally followed with a ..
The ^(?:\s*{ads})+\s*|\s*,(?:\s*{ads})+$ main pattern matches

^(?:\s*{ads})+\s* - start of string, then one or more sequences of zero or more whitespaces and ads pattern and then zero or more whitespaces
| - or
\s*, - zero or more whitespaces and a comma
(?:\s*{ads})+ - one or more repetitions of zero or more whitespaces and ads pattern
$ - end of string

, '', text, flags=re.I)

See the regex demo.

The (?:Dr[sa]?|Prof|M\.Ag|M[EM])\.? pattern matches Dr, Drs, Dra, Prof, M.Ag, ME, MM optionally followed with a ..

The ^(?:\s*{ads})+\s*|\s*,(?:\s*{ads})+$ main pattern matches

^(?:\s*{ads})+\s* - start of string, then one or more sequences of zero or more whitespaces and adspattern and then zero or more whitespaces
| - or
\s*, - zero or more whitespaces and a comma
(?:\s*{ads})+ - one or more repetitions of zero or more whitespaces and ads pattern
$ - end of string