如何从一段或一堆段落中查找标题短语
如何从段落中解析句子大小写短语。
例如,
柯南·道尔在这段话中说,福尔摩斯这个人物的灵感来自于约瑟夫·贝尔医生,道尔曾在爱丁堡皇家医院担任职员。和福尔摩斯一样,贝尔以从最小的观察中得出大的结论而闻名。[1]迈克尔·哈里森 (Michael Harrison) 在 1971 年埃勒里·奎恩 (Ellery Queen) 的悬疑杂志上发表的一篇文章中指出,该角色的灵感来自温德尔·谢勒 (Wendell Scherer),他是一名谋杀案的“咨询侦探”,据称该案于 1882 年在英国受到了报纸的广泛关注。
我们需要生成诸如Conan Doyle、Holmes、Dr Joseph Bell、Wendell Scherr 等。
如果可能的话,我更喜欢 Pythonic 解决方案
How do I parse sentence case phrases from a passage.
For example from this passage
Conan Doyle said that the character of Holmes was inspired by Dr. Joseph Bell, for whom Doyle had worked as a clerk at the Edinburgh Royal Infirmary. Like Holmes, Bell was noted for drawing large conclusions from the smallest observations.[1] Michael Harrison argued in a 1971 article in Ellery Queen's Mystery Magazine that the character was inspired by Wendell Scherer, a "consulting detective" in a murder case that allegedly received a great deal of newspaper attention in England in 1882.
We need to generate stuff like Conan Doyle, Holmes, Dr Joseph Bell, Wendell Scherr etc.
I would prefer a Pythonic Solution if possible
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这种处理可能非常棘手。这个简单的代码几乎做了正确的事情:
产生:
要包含“Dr. Joseph Bell”,您需要接受字符串中的句点,这允许“Edinburgh Royal Infirmary. Like Holmes”。
我遇到了类似的问题:分隔句子。
This kind of processing can be very tricky. This simple code does almost the right thing:
produces:
To include "Dr. Joseph Bell", you need to be ok with the period in the string, which allows in "Edinburgh Royal Infirmary. Like Holmes".
I had a similar problem: Separating Sentences.
“重新”方法很快就会失去动力。命名实体识别是一个非常复杂的主题,远远超出了 SO 答案的范围。如果你认为你有解决这个问题的好方法,请指出 Flann O'Brien aka Myles na cGopaleen、Sukarno、Harry S. Truman、J. Edgar Hoover、JK Rowling、数学家 L'Hopital、Joe di Maggio、阿尔杰农·道格拉斯·蒙塔古·斯科特和雨果·马克斯·格拉夫·冯·冯·祖·勒兴菲尔德·科弗林·勋伯格。
更新 以下是基于“re”的方法,可以找到更多有效案例。但我仍然不认为这是一个好的方法。注意:我已经在我的文本样本中将巴伐利亚伯爵的名字关联起来。如果有人真的想使用这样的东西,他们应该使用 Unicode,并在某个阶段(输入或输出时)标准化空格。
输出:
The "re" approach runs out of steam very quickly. Named entity recognition is a very complicated topic, way beyond the scope of an SO answer. If you think you have a good approach to this problem, please point it at Flann O'Brien a.k.a. Myles na cGopaleen, Sukarno, Harry S. Truman, J. Edgar Hoover, J. K. Rowling, the mathematician L'Hopital, Joe di Maggio, Algernon Douglas-Montagu-Scott, and Hugo Max Graf von und zu Lerchenfeld auf Köfering und Schönberg.
Update Following is an "re"-based approach that finds a lot more valid cases. I still don't think that this is a good approach, though. N.B. I've asciified the Bavarian count's name in my text sample. If anyone really wants to use something like this, they should work in Unicode, and normalise whitespace at some stage (either on input or on output).
Output: