使用 NLTK 将早期现代英语转换为 20 世纪拼写

发布于 2024-09-16 06:02:44 字数 1074 浏览 9 评论 0原文

我有一个字符串列表，这些字符串都是以“th”结尾的早期现代英语单词。这些词包括hath、appointeth、demandeth等——它们都是第三人称单数的变位形式。

作为一个更大项目的一部分（使用我的计算机将 Gargantua 和 Pantagruel 的古腾堡电子文本转换为更像 20 世纪英语的内容，以便我能够更轻松地阅读它）我想删除最后两三个所有这些单词中的字符并用“s”替换它们，然后对仍然没有现代化的单词使用稍微修改过的函数，这两个单词都包含在下面。

我的主要问题是我从来没有成功地在 Python 中输入正确的内容。我发现这部分语言在这一点上确实令人困惑。

这是删除 th 的函数：

from __future__ import division
import nltk, re, pprint

def ethrema(word):
    if word.endswith('th'):
        return word[:-2] + 's'

这是删除无关 e 的函数：

def ethremb(word):
    if word.endswith('es'):
        return word[:-2] + 's'

因此单词 'abateth' 和 'accuseth' 将通过 ethrema，但不会通过 ethrema(ethrema)，而单词 'abhorreth' 则需要通过两者。

如果有人能想出更有效的方法来做到这一点，我洗耳恭听。

这是我非常业余地尝试在需要现代化的标记化单词列表上使用这些函数的结果：

>>> eth1 = [w.ethrema() for w in text]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'ethrema'

所以，是的，这确实是一个打字问题。这些是我用 Python 编写的第一个函数，但我不知道如何将它们应用到实际对象中。

原文

I have a list of strings that are all early modern English words ending with 'th.' These include hath, appointeth, demandeth, etc. -- they are all conjugated for the third person singular.

As part of a much larger project (using my computer to convert the Gutenberg etext of Gargantua and Pantagruel into something more like 20th century English, so that I'll be able to read it more easily) I want to remove the last two or three characters from all of those words and replace them with an 's,' then use a slightly modified function on the words that still weren't modernized, both included below.

My main problem is that I just never manage to get my typing right in Python. I find that part of the language really confusing at this point.

Here's the function that removes th's:

from __future__ import division
import nltk, re, pprint

def ethrema(word):
    if word.endswith('th'):
        return word[:-2] + 's'

Here's the function that removes extraneous e's:

def ethremb(word):
    if word.endswith('es'):
        return word[:-2] + 's'

hence the words 'abateth' and 'accuseth' would pass through ethrema but not through ethremb(ethrema), while the word 'abhorreth' would need to pass through both.

If anyone can think of a more efficient way to do this, I'm all ears.

Here's the result of my very amateurish attempt to use these functions on a tokenized list of words that need modernizing:

>>> eth1 = [w.ethrema() for w in text]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'ethrema'

So, yeah, it's really an issue of typing. These are the first functions I've ever written in Python, and I have no idea how to apply them to actual objects.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

╭⌒浅淡时光〆 2024-09-23 06:02:44

ethrema() 不是 str 类型的方法，您必须使用以下内容：

eth1 = [ethrema(w) for w in text]
#AND
eth2 = [ethremb(w) for w in text]

编辑（回答评论）：

ethremb(ethrema(word))除非您对函数进行一些小更改，否则 将无法工作：

def ethrema(word):
    if word.endswith('th'):
        return word[:-2] + 's'
    else
        return word

def ethremb(word):
    if word.endswith('es'):
        return word[:-2] + 's'
    else
        return word

#OR

def ethrema(word):
    if word.endswith('th'):
        return word[:-2] + 's'
    elif word.endswith('es'):
        return word[:-2] + 's'
    else
        return word

ethrema() is not a method of the type str, you have to use the following :

eth1 = [ethrema(w) for w in text]
#AND
eth2 = [ethremb(w) for w in text]

EDIT (to answer comment) :

ethremb(ethrema(word)) wouldn't work until you made some little changes to your functions :

def ethrema(word):
    if word.endswith('th'):
        return word[:-2] + 's'
    else
        return word

def ethremb(word):
    if word.endswith('es'):
        return word[:-2] + 's'
    else
        return word

#OR

def ethrema(word):
    if word.endswith('th'):
        return word[:-2] + 's'
    elif word.endswith('es'):
        return word[:-2] + 's'
    else
        return word

回复收藏 0 原文

~没有更多了~