使用 NLTK 将早期现代英语转换为 20 世纪拼写
我有一个字符串列表,这些字符串都是以“th”结尾的早期现代英语单词。这些词包括hath、appointeth、demandeth等——它们都是第三人称单数的变位形式。
作为一个更大项目的一部分(使用我的计算机将 Gargantua 和 Pantagruel 的古腾堡电子文本转换为更像 20 世纪英语的内容,以便我能够更轻松地阅读它)我想删除最后两三个所有这些单词中的字符并用“s”替换它们,然后对仍然没有现代化的单词使用稍微修改过的函数,这两个单词都包含在下面。
我的主要问题是我从来没有成功地在 Python 中输入正确的内容。我发现这部分语言在这一点上确实令人困惑。
这是删除 th 的函数:
from __future__ import division
import nltk, re, pprint
def ethrema(word):
if word.endswith('th'):
return word[:-2] + 's'
这是删除无关 e 的函数:
def ethremb(word):
if word.endswith('es'):
return word[:-2] + 's'
因此单词 'abateth' 和 'accuseth' 将通过 ethrema,但不会通过 ethrema(ethrema),而单词 'abhorreth' 则需要通过两者。
如果有人能想出更有效的方法来做到这一点,我洗耳恭听。
这是我非常业余地尝试在需要现代化的标记化单词列表上使用这些函数的结果:
>>> eth1 = [w.ethrema() for w in text]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'ethrema'
所以,是的,这确实是一个打字问题。这些是我用 Python 编写的第一个函数,但我不知道如何将它们应用到实际对象中。
I have a list of strings that are all early modern English words ending with 'th.' These include hath, appointeth, demandeth, etc. -- they are all conjugated for the third person singular.
As part of a much larger project (using my computer to convert the Gutenberg etext of Gargantua and Pantagruel into something more like 20th century English, so that I'll be able to read it more easily) I want to remove the last two or three characters from all of those words and replace them with an 's,' then use a slightly modified function on the words that still weren't modernized, both included below.
My main problem is that I just never manage to get my typing right in Python. I find that part of the language really confusing at this point.
Here's the function that removes th's:
from __future__ import division
import nltk, re, pprint
def ethrema(word):
if word.endswith('th'):
return word[:-2] + 's'
Here's the function that removes extraneous e's:
def ethremb(word):
if word.endswith('es'):
return word[:-2] + 's'
hence the words 'abateth' and 'accuseth' would pass through ethrema but not through ethremb(ethrema), while the word 'abhorreth' would need to pass through both.
If anyone can think of a more efficient way to do this, I'm all ears.
Here's the result of my very amateurish attempt to use these functions on a tokenized list of words that need modernizing:
>>> eth1 = [w.ethrema() for w in text]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'ethrema'
So, yeah, it's really an issue of typing. These are the first functions I've ever written in Python, and I have no idea how to apply them to actual objects.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
ethrema()
不是str
类型的方法,您必须使用以下内容:编辑(回答评论):
ethremb(ethrema(word))除非您对函数进行一些小更改,否则
将无法工作:ethrema()
is not a method of the typestr
, you have to use the following :EDIT (to answer comment) :
ethremb(ethrema(word))
wouldn't work until you made some little changes to your functions :