Identifying and then extracting all the verbs within a text is very easy using a Part-of-Speech (POS) tagger. Such taggers label all of the words in a text with part-of-speech tags that indicate whether they are verbs, nouns, adjectives, adverbs, etc. Modern POS taggers are very accurate. For example, Toutanova et al. 2003 reports Stanford's open source POS tagger assigns the correct tag 97.24% of time on newswire data.
Performing POS tagging
Java If you're using Java, a good package for POS tagging is the Stanford Log-linear Part-Of-Speech Tagger. Matthew Jockers put together a great tutorial on using this tagger that you can find here.
Python If you prefer Python, you can make use of the POS tagger included in the Natural Language Toolkit (nltk). A code snippet demonstrating how to perform POS tagging using this package is given below:
import nltk
text = "I am very happy to be here today"
tokens = nltk.word_tokenize(text)
pos_tagged_tokens = nltk.pos_tag(tokens)
The resulting POS tagged tokens will be an array of tuples, where the first entry in each tuple is the identity of the tagged word and the second entry is the word's POS tag, e.g. for the code snippet above pos_tagged_tokens will be set to:
Both the Stanford POS tagger and NLTK use the Penn Treebank tag set. If you're just interested in extracting the verbs, pull out all words that have a POS tag that starts with a "V" (e.g., VB, VBD, VBG, VBN, VBP, and VBZ).
While you'll hardly come across extreme cases like this, there are dozens of verbs that could also be nouns, adjectives etc if you just look at the word.
You need a natural language parser like Stanford NLP. I have never used one, so I don't know how good your results are going to be, but better than with Regex, I can tell you that.
This is actually a very hard task in NLP (Natural Language Processing). Regular expressions on there own won't be enough. Take, for example, the word "training" - it can be used as either a verb or a noun ("I'm going to the training session"). Obviously, a regular expression won't be able to tell the difference between the two. There are problems as well, the "-ed" is a common way to end past tense verbs, but will fail you in the case of "disgusted".
There are some techniques that can provide you with good (not perfect, but good) indication of whether or not a given word is a verb or not - they can also be quite expensive computationally.
So the first question you should ask yourself (in my opinion), is what quality of answer vs. how much processing time are you interested in.
发布评论
评论(4)
词性标注器
使用 词性 (POS) 标记器。此类标注器使用词性标记来标记文本中的所有单词,指示它们是否是动词、名词、形容词、副词等。现代词性标注器非常准确。例如,图塔诺瓦等人。 2003 年报道斯坦福大学的开源词性标注器在 97.24% 的时间内为新闻专线数据分配了正确的标签。
执行词性标记
Java 如果您使用 Java,则 斯坦福对数线性词性标注器。 Matthew Jockers 整理了一个关于使用此标记器的精彩教程,您可以在 此处找到。
Python 如果您更喜欢 Python,您可以使用 自然语言工具包 ( nltk)。下面给出了演示如何使用此包执行 POS 标记的代码片段:
生成的 POS 标记令牌将是一个元组数组,其中每个元组中的第一个条目是标记单词的标识,第二个条目是该单词的 POS标签,例如上面的代码片段
pos_tagged_tokens
将被设置为:理解标签集
斯坦福 POS 标签器和 NLTK 都使用 Penn Treebank 标签集。如果您只是对提取动词感兴趣,请提取所有具有以“V”开头的 POS 标记的单词(例如 VB、VBD、VBG、VBN、VBP 和 VBZ)。
Part of Speech tagger
Identifying and then extracting all the verbs within a text is very easy using a Part-of-Speech (POS) tagger. Such taggers label all of the words in a text with part-of-speech tags that indicate whether they are verbs, nouns, adjectives, adverbs, etc. Modern POS taggers are very accurate. For example, Toutanova et al. 2003 reports Stanford's open source POS tagger assigns the correct tag 97.24% of time on newswire data.
Performing POS tagging
Java If you're using Java, a good package for POS tagging is the Stanford Log-linear Part-Of-Speech Tagger. Matthew Jockers put together a great tutorial on using this tagger that you can find here.
Python If you prefer Python, you can make use of the POS tagger included in the Natural Language Toolkit (nltk). A code snippet demonstrating how to perform POS tagging using this package is given below:
The resulting POS tagged tokens will be an array of tuples, where the first entry in each tuple is the identity of the tagged word and the second entry is the word's POS tag, e.g. for the code snippet above
pos_tagged_tokens
will be set to:Understanding the Tag Set
Both the Stanford POS tagger and NLTK use the Penn Treebank tag set. If you're just interested in extracting the verbs, pull out all words that have a POS tag that starts with a "V" (e.g., VB, VBD, VBG, VBN, VBP, and VBZ).
用正则表达式解析自然语言是不可能的。算了。
举个极端的例子:你会如何找到这句话中的动词(用星号标记)?
虽然你几乎不会遇到这样的极端情况,但如果你只看这个词,有几十个动词也可以是名词、形容词等。
您需要一个自然语言解析器,例如 Stanford NLP。我从未使用过,所以我不知道你的结果会有多好,但比使用正则表达式更好,我可以告诉你。
Parsing natural language with regex is impossible. Forget it.
As a drastic example: How would you find the verbs (marked with asterisks) in this sentence?
While you'll hardly come across extreme cases like this, there are dozens of verbs that could also be nouns, adjectives etc if you just look at the word.
You need a natural language parser like Stanford NLP. I have never used one, so I don't know how good your results are going to be, but better than with Regex, I can tell you that.
虽然过了一年,但是我发现西北大学有一个非常好用的工具,叫做MorphAdorner。
它可以处理各种情况,例如词形还原、语言识别、名称识别、解析器、句子分割器等。
方便易用。
Although one year later, but I found a very useful tool from Northwestern University called MorphAdorner.
It handles all kind of situations, e.g. lemmatization, language recognition, name recognition, parser, sentence splitter, etc..
Convenient easy to use.
这实际上是NLP(自然语言处理)中非常困难的任务。仅靠正则表达式是不够的。以“培训”一词为例,它可以用作动词或名词(“我要去参加培训课程”)。显然,正则表达式无法区分两者之间的区别。也有问题,“-ed”是结束过去时态动词的常用方式,但在“disgusted”的情况下会让你失败。
有一些技术可以为您提供良好(不完美,但很好)的指示来判断给定单词是否是动词 - 它们的计算成本也可能相当高。
因此,您应该问自己的第一个问题(在我看来)是,您对答案的质量与处理时间感兴趣。
This is actually a very hard task in NLP (Natural Language Processing). Regular expressions on there own won't be enough. Take, for example, the word "training" - it can be used as either a verb or a noun ("I'm going to the training session"). Obviously, a regular expression won't be able to tell the difference between the two. There are problems as well, the "-ed" is a common way to end past tense verbs, but will fail you in the case of "disgusted".
There are some techniques that can provide you with good (not perfect, but good) indication of whether or not a given word is a verb or not - they can also be quite expensive computationally.
So the first question you should ask yourself (in my opinion), is what quality of answer vs. how much processing time are you interested in.