如何从一段或一堆段落中查找标题短语

发布于 2024-08-03 02:34:07 字数 378 浏览 10 评论 0原文

如何从段落中解析句子大小写短语。

例如，

柯南·道尔在这段话中说，福尔摩斯这个人物的灵感来自于约瑟夫·贝尔医生，道尔曾在爱丁堡皇家医院担任职员。和福尔摩斯一样，贝尔以从最小的观察中得出大的结论而闻名。[1]迈克尔·哈里森 (Michael Harrison) 在 1971 年埃勒里·奎恩 (Ellery Queen) 的悬疑杂志上发表的一篇文章中指出，该角色的灵感来自温德尔·谢勒 (Wendell Scherer)，他是一名谋杀案的“咨询侦探”，据称该案于 1882 年在英国受到了报纸的广泛关注。

我们需要生成诸如Conan Doyle、Holmes、Dr Joseph Bell、Wendell Scherr 等。

如果可能的话，我更喜欢 Pythonic 解决方案

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如痴如狂 2024-08-10 02:34:07

这种处理可能非常棘手。这个简单的代码几乎做了正确的事情：

for s in re.finditer(r"([A-Z][a-z]+[. ]+)+([A-Z][a-z]+)?", text):
    print s.group(0)

产生：

Conan Doyle
Holmes
Dr. Joseph Bell
Doyle
Edinburgh Royal Infirmary. Like Holmes
Bell
Michael Harrison
Ellery Queen
Mystery Magazine
Wendell Scherer
England

要包含“Dr. Joseph Bell”，您需要接受字符串中的句点，这允许“Edinburgh Royal Infirmary. Like Holmes”。

我遇到了类似的问题：分隔句子。

This kind of processing can be very tricky. This simple code does almost the right thing:

for s in re.finditer(r"([A-Z][a-z]+[. ]+)+([A-Z][a-z]+)?", text):
    print s.group(0)

produces:

Conan Doyle
Holmes
Dr. Joseph Bell
Doyle
Edinburgh Royal Infirmary. Like Holmes
Bell
Michael Harrison
Ellery Queen
Mystery Magazine
Wendell Scherer
England

To include "Dr. Joseph Bell", you need to be ok with the period in the string, which allows in "Edinburgh Royal Infirmary. Like Holmes".

I had a similar problem: Separating Sentences.

回复收藏 0 原文

貪欢 2024-08-10 02:34:07

“重新”方法很快就会失去动力。命名实体识别是一个非常复杂的主题，远远超出了 SO 答案的范围。如果你认为你有解决这个问题的好方法，请指出 Flann O'Brien aka Myles na cGopaleen、Sukarno、Harry S. Truman、J. Edgar Hoover、JK Rowling、数学家 L'Hopital、Joe di Maggio、阿尔杰农·道格拉斯·蒙塔古·斯科特和雨果·马克斯·格拉夫·冯·冯·祖·勒兴菲尔德·科弗林·勋伯格。

更新以下是基于“re”的方法，可以找到更多有效案例。但我仍然不认为这是一个好的方法。注意：我已经在我的文本样本中将巴伐利亚伯爵的名字关联起来。如果有人真的想使用这样的东西，他们应该使用 Unicode，并在某个阶段（输入或输出时）标准化空格。

import re

text1 = """Conan Doyle said that the character of Holmes was inspired by Dr. Joseph Bell, for whom Doyle had worked as a clerk at the Edinburgh Royal Infirmary. Like Holmes, Bell was noted for drawing large conclusions from the smallest observations.[1] Michael Harrison argued in a 1971 article in Ellery Queen's Mystery Magazine that the character was inspired by Wendell Scherer, a "consulting detective" in a murder case that allegedly received a great deal of newspaper attention in England in 1882."""

text2 = """Flann O'Brien a.k.a. Myles na cGopaleen, I Zingari, Sukarno and Suharto, Harry S. Truman, J. Edgar Hoover, J. K. Rowling, the mathematician L'Hopital, Joe di Maggio, Algernon Douglas-Montagu-Scott, and Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg."""

pattern1 = r"(?:[A-Z][a-z]+[. ]+)+(?:[A-Z][a-z]+)?"

joiners = r"' - de la du von und zu auf van der na di il el bin binte abu etcetera".split()

pattern2 = r"""(?x)
    (?:
        (?:[ .]|\b%s\b)*
        (?:\b[a-z]*[A-Z][a-z]*\b)?
    )+
    """ % r'\b|\b'.join(joiners)

def get_names(pattern, text):
    for m in re.finditer(pattern, text):
        s = m.group(0).strip(" .'-")
        if s:
            yield s

for t in (text1, text2):
    print "*** text: ", t[:20], "..."
    print "=== Ned B"
    for s in re.finditer(pattern1):
        print repr(s.group(0))
    print "=== John M =="
    for name in get_names(pattern2, t):
        print repr(name)

输出：

C:\junk\so>\python26\python extract_names.py
*** text:  Conan Doyle said tha ...
=== Ned B
'Conan Doyle '
'Holmes '
'Dr. Joseph Bell'
'Doyle '
'Edinburgh Royal Infirmary. Like Holmes'
'Bell '
'Michael Harrison '
'Ellery Queen'
'Mystery Magazine '
'Wendell Scherer'
'England '
=== John M ==
'Conan Doyle'
'Holmes'
'Dr. Joseph Bell'
'Doyle'
'Edinburgh Royal Infirmary. Like Holmes'
'Bell'
'Michael Harrison'
'Ellery Queen'
'Mystery Magazine'
'Wendell Scherer'
'England'
*** text:  Flann O'Brien a.k.a. ...
=== Ned B
'Flann '
'Brien '
'Myles '
'Sukarno '
'Harry '
'Edgar Hoover'
'Joe '
'Algernon Douglas'
'Hugo Max Graf '
'Lerchenfeld '
'Koefering '
'Schoenberg.'
=== John M ==
"Flann O'Brien"
'Myles na cGopaleen'
'I Zingari'
'Sukarno'
'Suharto'
'Harry S. Truman'
'J. Edgar Hoover'
'J. K. Rowling'
"L'Hopital"
'Joe di Maggio'
'Algernon Douglas-Montagu-Scott'
'Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg'

The "re" approach runs out of steam very quickly. Named entity recognition is a very complicated topic, way beyond the scope of an SO answer. If you think you have a good approach to this problem, please point it at Flann O'Brien a.k.a. Myles na cGopaleen, Sukarno, Harry S. Truman, J. Edgar Hoover, J. K. Rowling, the mathematician L'Hopital, Joe di Maggio, Algernon Douglas-Montagu-Scott, and Hugo Max Graf von und zu Lerchenfeld auf Köfering und Schönberg.

Update Following is an "re"-based approach that finds a lot more valid cases. I still don't think that this is a good approach, though. N.B. I've asciified the Bavarian count's name in my text sample. If anyone really wants to use something like this, they should work in Unicode, and normalise whitespace at some stage (either on input or on output).

import re

text1 = """Conan Doyle said that the character of Holmes was inspired by Dr. Joseph Bell, for whom Doyle had worked as a clerk at the Edinburgh Royal Infirmary. Like Holmes, Bell was noted for drawing large conclusions from the smallest observations.[1] Michael Harrison argued in a 1971 article in Ellery Queen's Mystery Magazine that the character was inspired by Wendell Scherer, a "consulting detective" in a murder case that allegedly received a great deal of newspaper attention in England in 1882."""

text2 = """Flann O'Brien a.k.a. Myles na cGopaleen, I Zingari, Sukarno and Suharto, Harry S. Truman, J. Edgar Hoover, J. K. Rowling, the mathematician L'Hopital, Joe di Maggio, Algernon Douglas-Montagu-Scott, and Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg."""

pattern1 = r"(?:[A-Z][a-z]+[. ]+)+(?:[A-Z][a-z]+)?"

joiners = r"' - de la du von und zu auf van der na di il el bin binte abu etcetera".split()

pattern2 = r"""(?x)
    (?:
        (?:[ .]|\b%s\b)*
        (?:\b[a-z]*[A-Z][a-z]*\b)?
    )+
    """ % r'\b|\b'.join(joiners)

def get_names(pattern, text):
    for m in re.finditer(pattern, text):
        s = m.group(0).strip(" .'-")
        if s:
            yield s

for t in (text1, text2):
    print "*** text: ", t[:20], "..."
    print "=== Ned B"
    for s in re.finditer(pattern1):
        print repr(s.group(0))
    print "=== John M =="
    for name in get_names(pattern2, t):
        print repr(name)

Output:

C:\junk\so>\python26\python extract_names.py
*** text:  Conan Doyle said tha ...
=== Ned B
'Conan Doyle '
'Holmes '
'Dr. Joseph Bell'
'Doyle '
'Edinburgh Royal Infirmary. Like Holmes'
'Bell '
'Michael Harrison '
'Ellery Queen'
'Mystery Magazine '
'Wendell Scherer'
'England '
=== John M ==
'Conan Doyle'
'Holmes'
'Dr. Joseph Bell'
'Doyle'
'Edinburgh Royal Infirmary. Like Holmes'
'Bell'
'Michael Harrison'
'Ellery Queen'
'Mystery Magazine'
'Wendell Scherer'
'England'
*** text:  Flann O'Brien a.k.a. ...
=== Ned B
'Flann '
'Brien '
'Myles '
'Sukarno '
'Harry '
'Edgar Hoover'
'Joe '
'Algernon Douglas'
'Hugo Max Graf '
'Lerchenfeld '
'Koefering '
'Schoenberg.'
=== John M ==
"Flann O'Brien"
'Myles na cGopaleen'
'I Zingari'
'Sukarno'
'Suharto'
'Harry S. Truman'
'J. Edgar Hoover'
'J. K. Rowling'
"L'Hopital"
'Joe di Maggio'
'Algernon Douglas-Montagu-Scott'
'Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg'

回复收藏 0 原文

~没有更多了~