Python 中与语法、标记、词干和词义消歧有关的一些 NLP 内容

发布于 2024-12-21 21:23:04 字数 2341 浏览 1 评论 0原文

背景(TLDR;为了完成而提供)

寻求有关奇怪需求的最佳解决方案的建议。 我是一名大学四年级的(文学)学生,只有我自己的编程指导。我对 Python 有足够的能力,所以我不会在实现我找到的解决方案(大多数时候)并在其上进行开发时遇到问题,但由于我是新手,我正在寻求关于最佳的建议我可能会解决这个特殊问题的方法。

已经在使用 NLTK,但与 NLTK 书中的示例不同。我已经使用了 NLTK 中的很多内容,特别是 WordNet,因此这些材料对我来说并不陌生。我读过 NLTK 书的大部分内容。

我正在使用片段式原子语言。用户输入单词和句子片段,WordNet 用于查找输入之间的联系,并生成新单词和句子/片段。我的问题是如何将 WordNet(同义词集)中的未变形单词转换为在上下文中有意义的单词。

问题:如何以语法上合理的方式改变结果?如果没有任何形式的语法处理,结果只是字典可搜索单词的列表,单词之间没有一致性。第一步是我的应用程序根据上下文对词根进行词干/复数/共轭/变形。 (我所说的“根词”是来自 WordNet 和/或其人类可读等价物的同义词。)

示例场景

假设我们有一首诗的一大块,用户正在向其中添加新的输入。新的结果需要以语法上合理的方式进行改变。

The river bears no empty bottles, sandwich papers,   
Silk handkerchiefs, cardboard boxes, cigarette ends  
Or other testimony of summer nights. The sprites

现在假设,它需要打印 4 个可能的下一个单词/同义词中的 1 个:['departure', 'to have', 'blue', 'quick']。在我看来,'blue'应该被丢弃; “The sprites blue” 在语法上似乎很奇怪/不太可能。从那里它可以使用这些动词中的任何一个。

如果它选择 'to have',结果可能会明智地变形为 'had''have''having'< /code>、'will have''would have' 等(但不是 'has')。 (结果行将类似于 'The sprites have' 并且合理变形的结果将为未来的结果提供更好的上下文......)

我想要 'depature' 在这种情况下是有效的可能性;而'精灵离开'没有意义(它不是“精灵'”),'精灵离开'(或其他< em>动词变形) will。

看起来“精灵快”没有意义,但类似“精灵快[...]”“精灵快” code> 可以,所以 'quick' 也是明智的变形的可能性。

分解

  1. 原始输入的任务标记词性、复数、时态等。注意到这一点可以帮助从几种可能性中进行选择(即,如果用户输入'having'而不是其他时态,则在had/have/having之间进行选择可能比随机更直接)。我听说斯坦福 POS 标记器很好,它在 NLTK 中有一个实现。我不知道如何处理这里的时态检测。
  2. 考虑上下文,以排除语法上特殊的可能性。考虑最后几个单词及其词性标签(和时态?),以及句子边界(如果有),然后从中删除一些没有意义的事情。在'The sprites'之后,我们不需要另一个冠词(或限定词,据我所知),也不需要形容词,但副词或动词可以。将当前内容与标记语料库(和/或马尔可夫链?)中的序列进行比较——或者咨询语法检查函数——可以为此提供解决方案。
  3. 从剩余的可能性中选择一个单词(那些可以合理变形的单词)。这不是我需要答案的事情——我有我的方法。假设它是随机选择的。
  4. 根据需要转换所选单词。如果来自 #1 的信息可以折叠起来(例如,“pluralize”标志可能设置为 True),请这样做。如果有多种可能性(例如,所选单词是动词,但可能有几种时态),请随机选择。无论如何,我需要对这个词进行变形/变形。

我正在寻找有关此例程健全性的建议,以及有关添加步骤的建议。进一步分解这些步骤的方法也会有所帮助。最后,我正在寻找有关哪种工具最能完成每项任务的建议。

Background (TLDR; provided for the sake of completion)

Seeking advice on an optimal solution to an odd requirement.
I'm a (literature) student in my fourth year of college with only my own guidance in programming. I'm competent enough with Python that I won't have trouble implementing solutions I find (most of the time) and developing upon them, but because of my newbness, I'm seeking advice on the best ways I might tackle this peculiar problem.

Already using NLTK, but differently from the examples in the NLTK book. I'm already utilizing a lot of stuff from NLTK, particularly WordNet, so that material is not foreign to me. I've read most of the NLTK book.

I'm working with fragmentary, atomic language. Users input words and sentence fragments, and WordNet is used to find connections between the inputs, and generate new words and sentences/fragments. My question is about turning an uninflected word from WordNet (a synset) into something that makes sense contextually.

The problem: How to inflect the result in a grammatically sensible way? Without any kind of grammatical processing, the results are just a list of dictionary-searchable words, without agreement between words. First step is for my application to stem/pluralize/conjugate/inflect root-words according to context. (The "root words" I'm speaking of are synsets from WordNet and/or their human-readable equivalents.)

Example scenario

Let's assume we have a chunk of a poem, to which users are adding new inputs to. The new results need to be inflected in a grammatically sensible way.

The river bears no empty bottles, sandwich papers,   
Silk handkerchiefs, cardboard boxes, cigarette ends  
Or other testimony of summer nights. The sprites

Let's say now, it needs to print 1 of 4 possible next words/synsets: ['departure', 'to have', 'blue', 'quick']. It seems to me that 'blue' should be discarded; 'The sprites blue' seems grammatically odd/unlikely. From there it could use either of these verbs.

If it picks 'to have' the result could be sensibly inflected as 'had', 'have', 'having', 'will have', 'would have', etc. (but not 'has'). (The resulting line would be something like 'The sprites have' and the sensibly-inflected result will provide better context for future results ...)

I'd like for 'depature' to be a valid possibility in this case; while 'The sprites departure' doesn't make sense (it's not "sprites'"), 'The sprites departed' (or other verb conjugations) would.

Seemingly 'The sprites quick' wouldn't make sense, but something like 'The sprites quickly [...]' or 'The sprites quicken' could, so 'quick' is also a possibility for sensible inflection.

Breaking down the tasks

  1. Tag part of speech, plurality, tense, etc. -- of original inputs. Taking note of this could help to select from the several possibilities (i.e. choosing between had/have/having could be more directed than random if a user had inputted 'having' rather than some other tense). I've heard the Stanford POS tagger is good, which has an implementation in NLTK. I am not sure how to handle tense detection here.
  2. Consider context in order to rule out grammatically peculiar possibilities. Consider the last couple words and their part-of-speech tags (and tense?), as well as sentence boundaries if any, and from that, drop things that wouldn't make sense. After 'The sprites' we don't want another article (or determiner, as far as I can tell), nor an adjective, but an adverb or verb could work. Comparison of the current stuff with sequences in tagged corpora (and/or Markov chains?) -- or consultation of grammar-checking functions -- could provide a solution for this.
  3. Select a word from the remaining possibilities (those that could be inflected sensibly). This isn't something I need an answer for -- I've got my methods for this. Let's say it's randomly selected.
  4. Transform the selected word as needed. If the information from #1 can be folded in (for example, perhaps the "pluralize" flag was set to True), do so. If there are several possibilities (e.g. picked word is a verb, but a few tenses are possible) select, randomly. Regardless I'm going to need to morph/inflect the word.

I'm looking for advice on the soundness of this routine, as well as suggestions for steps to add. Ways to break down these steps further would also be helpful. Finally I'm looking for suggestions on what tool might best accomplish each task.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

日裸衫吸 2024-12-28 21:23:04

我认为上面关于 n-gram 语言模型的评论比解析和标记更符合您的要求。解析器和标注器(除非经过修改)将因缺乏目标单词的正确上下文而受到影响(即,在查询时没有可用的句子的其余部分)。另一方面,语言模型可以有效地考虑过去(左上下文),特别是对于最多 5 个单词的窗口。 n-gram 的问题在于它们无法对长距离依赖关系(超过n 个单词)进行建模。

NLTK 有一个语言模型:http://nltk .googlecode.com/svn/trunk/doc/api/nltk.model.ngram-pysrc.html。标签词典可以帮助您更加平滑模型。

我认为的步骤: 1. 从用户那里获取一组单词。 2. 创建一个包含所有可能的词形变化的更大集合。 3. 询问模型哪个词形变化的单词最有可能。

I think that the comment above on n-gram language model fits your requirements better than parsing and tagging. Parsers and taggers (unless modified) will suffer from the lack of right context of the target word (i.e., you don't have the rest of the sentence available at time of query). On the other hand, language models consider the past (left context) efficiently, especially for windows up to 5 words. The problem with n-grams is that they don't model long distance dependencies (more than n words).

NLTK has a language model: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.model.ngram-pysrc.html . A tag lexicon may help you smooth the model more.

The steps as I see them: 1. Get a set of words from the users. 2. Create a larger set of all possible inflections of the words. 3. Ask the model which inflected word is most probable.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文