通过语法检查从一组可能性中选择最流畅的文本 (Python)
一些背景
我是佛罗里达新学院的文学系学生,目前正在从事一个过于雄心勃勃的创意项目。 该项目面向诗歌的算法生成。它是用 Python 编写的。我的 Python 知识和自然语言处理知识仅来自通过互联网自学。我已经从事这些工作大约一年了,所以我并不是无助,但在这个项目的推进过程中,我在很多方面都遇到了困难。目前,我正进入开发的最后阶段,并遇到了一些障碍。
我需要实现某种形式的语法规范化,这样输出就不会出现为非共轭/屈折的穴居人语言。大约一个月前,一些友好的人在 SO 给了我一些关于如何通过使用 ngram 语言建模器解决这个问题的建议,基本上 - 但我正在寻找其他解决方案,因为 NLTK 的 NgramModeler 似乎不适合我的需要。 (也提到了 POS 标记的可能性,但考虑到我的业余性,我的文本可能太零碎和奇怪,无法轻松实现。)
也许我需要像 AtD 这样的东西,但希望不那么复杂
I认为需要一些类似于截止日期之后或Queequeg,但这两个似乎都不完全正确。 Queequeg 可能不太适合——它是 2003 年为 Unix 编写的,我一生都无法让它在 Windows 上运行(已经尝试了所有方法)。但我喜欢它检查的只是正确的动词变形和数字一致性。
另一方面,AtD 更加严格,提供的功能比我需要的更多。但我似乎无法获得 python 绑定 让它工作。 (我从 AtD 服务器收到 502 错误,我确信很容易修复,但我的应用程序将在线,我宁愿避免依赖另一台服务器。我无力运行 AtD 服务器我自己,因为我的应用程序需要我的网络主机提供的“服务”数量已经有可能导致以低廉的价格托管此应用程序时出现问题。)
我想避免的事情
我自己构建 Ngram 语言模型不会”似乎适合任务。我的应用程序抛出了很多未知的词汇,从而扭曲了所有结果。 (除非我使用的语料库太大,以至于对于我的应用程序而言运行速度太慢 - 应用程序需要非常敏捷。)
严格检查语法并不适合该任务。语法不适合。不需要完美,句子也不必比使用 ngram 生成的类似英语的胡言乱语更有意义。即使它是胡言乱语,我只需要强制执行动词变位、数字一致,以及删除多余的冠词之类的事情。
事实上,我什至不需要任何建议来进行更正。我认为我所需要的只是统计一组可能的句子中每个句子中似乎出现了多少错误,这样我就可以按它们的分数进行排序并选择语法问题最少的那个。
一个简单的解决方案?通过检测明显错误来提高流畅度
如果存在一个脚本可以解决所有这些问题,我会非常高兴(我还没有找到一个)。当然,我可以为我找不到的东西编写代码;我正在寻找有关如何优化我的方法的建议。
假设我们已经布局了一小部分文本:
existing_text = "The old River"
现在假设我的脚本需要弄清楚接下来可能出现动词“to bear”的哪种变形。我愿意接受有关此例程的建议。 但我主要需要步骤 #2 的帮助,通过统计语法错误来评定流畅性:
- 使用 NodeBox Linguistics 得出该动词的所有变形;
['bear', 'bears', 'bearing', 'bore', 'borne']
。 - 迭代可能性,(浅层地)检查由
existing_text + " " + opportunity
生成的字符串的语法(“The old River Bear”、“The Old River Bears”等)。统计每个构造的错误计数。在这种情况下,唯一引发错误的结构似乎是“老河熊”。 - 总结应该很容易......在错误数最低的可能性中,随机选择。
Some background
I am a literature student at New College of Florida, currently working on an overly ambitious creative project. The project is geared towards the algorithmic generation of poetry. It's written in Python. My Python knowledge and Natural Language Processing knowledge come only from teaching myself things through the internet. I've been working with this stuff for about a year, so I'm not helpless, but at various points I've had trouble moving forward in this project. Currently, I am entering the final phases of development, and have hit a little roadblock.
I need to implement some form of grammatical normalization, so that the output doesn't come out as un- conjugated/inflected caveman-speak. About a month ago some friendly folks on SO gave me some advice on how I might solve this issue by using an ngram language modeller, basically -- but I'm looking for yet other solutions, as it seems that NLTK's NgramModeler is not fit for my needs. (The possibilities of POS tagging were also mentioned, but my text may be too fragmentary and strange for an implementation of such to come easy, given my amateur-ness.)
Perhaps I need something like AtD, but hopefully less complex
I think need something that works like After the Deadline or Queequeg, but neither of these seem exactly right. Queequeg is probably not a good fit -- it was written in 2003 for Unix and I can't get it working on Windows for the life of me (have tried everything). But I like that all it checks for is proper verb conjugation and number agreement.
On the other hand, AtD is much more rigorous, offering more capabilities than I need. But I can't seem to get the python bindings for it working. (I get 502 errors from the AtD server, which I'm sure are easy to fix, but my application is going to be online, and I'd rather avoid depending on another server. I can't afford to run an AtD server myself, because the number of "services" my application is going to require of my web host is already threatening to cause problems in getting this application hosted cheaply.)
Things I'd like to avoid
Building Ngram language models myself doesn't seem right for the task. my application throws a lot of unknown vocabulary, skewing all the results. (Unless I use a corpus that's so large that it runs way too slow for my application -- the application needs to be pretty snappy.)
Strictly checking grammar is neither right for the task. the grammar doesn't need to be perfect, and the sentences don't have to be any more sensible than the kind of English-like jibberish that you can generate using ngrams. Even if it's jibberish, I just need to enforce verb conjugation, number agreement, and do things like remove extra articles.
In fact, I don't even need any kind of suggestions for corrections. I think all I need is for something to tally up how many errors seem to occur in each sentence in a group of possible sentences, so I can sort by their score and pick the one with the least grammatical issues.
A simple solution? Scoring fluency by detecting obvious errors
If a script exists that takes care of all this, I'd be overjoyed (I haven't found one yet). I can write code for what I can't find, of course; I'm looking for advice on how to optimize my approach.
Let's say we have a tiny bit of text already laid out:
existing_text = "The old river"
Now let's say my script needs to figure out which inflection of the verb "to bear" could come next. I'm open to suggestions about this routine. But I need help mostly with step #2, rating fluency by tallying grammatical errors:
- Use the Verb Conjugation methods in NodeBox Linguistics to come up with all conjugations of this verb;
['bear', 'bears', 'bearing', 'bore', 'borne']
. - Iterate over the possibilities, (shallowly) checking the grammar of the string resulting from
existing_text + " " + possibility
("The old river bear", "The old river bears", etc). Tally the error count for each construction. In this case the only construction to raise an error, seemingly, would be "The old river bear". - Wrapping up should be easy... Of the possibilities with the lowest error count, select randomly.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,非常酷的项目。
我找到了一个 java 语法检查器。我从未使用过它,但文档声称它可以作为服务器运行。基本上任何地方都应该支持 java 和监听端口。
我刚刚进入具有 CS 背景的 NLP,所以我不介意了解更多细节来帮助您集成您决定使用的任何内容。请随时询问更多详细信息。
Very cool project, first of all.
I found a java grammar checker. I've never used it but the docs claim it can run as a server. Both java and listening to a port should be supported basically anywhere.
I'm just getting into NLP with a CS background so I wouldn't mind going into more detail to help you integrate whatever you decide on using. Feel free to ask for more detail.
另一种方法是使用所谓的过度生成和排名方法。第一步,您让诗歌生成器生成多个候选代。然后使用亚马逊 Mechanical Turk 等服务来收集人类对流畅性的判断。我实际上建议收集对相同种子条件生成的多个句子的同时判断。最后,您从生成的句子中提取特征(可能使用某种形式的句法解析器)来训练模型来对问题质量进行评级或分类。您甚至可以加入上面列出的启发法。
Michael Heilman 使用这种方法来生成问题。有关更多详细信息,请阅读这些论文:
好问题!问题生成的统计排名和
使用 Mechanical Turk 对计算机生成的问题进行评分。
Another approach would be to use what is called an overgenerate and rank approach. In the first step you have your poetry generator generate multiple candidate generations. Then using a service like Amazon's Mechanical Turk to collect human judgments of fluency. I would actually suggest collecting simultaneous judgments for a number of sentences generated from the same seed conditions. Lastly, you extract features from the generated sentences (presumably using some form of syntactic parser) to train a model to rate or classify question quality. You could even thrown in the heuristics listed above.
Michael Heilman uses this approach for question generation. For more details, read these papers:
Good Question! Statistical Ranking for Question Generation and
Rating Computer-Generated Questions with Mechanical Turk.
上面提供的 pylinkgrammar 链接有点过时了。它指向版本 0.1.9,该版本的代码示例不再有效。如果您走这条路,请务必使用最新版本,可以在以下位置找到:
https://pypi .python.org/pypi/pylinkgrammar
The pylinkgrammar link provided above is a bit out of date. It points to version 0.1.9, and the code samples for that version no longer work. If you go down this path, be sure to use the latest version which can be found at:
https://pypi.python.org/pypi/pylinkgrammar