If all you want is adjective frequencies, then the problem is relatively simple, as opposed to some brutal, not-so-good machine learning solution.
Wat do?
Do POS tagging on your text. This annotates your text with part of speech tags, so you'll have 95% accuracy or more on that. You can tag your text using the Stanford Parser online to get a feel for it. The parser actually also gives you the grammatical structure, but you only care about the tagging.
You also want to make sure the sentences are broken up properly. For this you need a sentence breaker. That's included with software like the Stanford parser.
Then just break up the sentences, tag them, and count all things with the tag ADJ or whatever tag they use. If the tags don't make sense, look up the Penn Treebank tagset (Treebanks are used to train NLP tools, and the Penn Treebank tags are the common ones).
How?
Java or Python is the language of NLP tools. Python, use NLTK. It's easy, well documented and well understood.
For Java, you have GATE, LingPipe and the Stanford Parser among others. It's a complete pain in the ass to use the Stanford Parser, fortunately I've suffered so you do not have to if you choose to go that route. See my google page for some code (at the bottom of the page) examples with the Stanford Parser.
Das all?
Nah, you might want to stem the adjectives too- that's where you get the root form of a word:
cars -> car
I can't actually think of a situation where this is necessary with adjectives, but it might happen. When you look at your output it'll be apparent if you need to do this. A POS tagger/parser/etc will get you your stemmed words (also called lemmas).
It depends on the source of your data. If the sentences come from some kind of generator, you can probably split them automatically. Otherwise you will need NLP, yes.
Properly parsing natural language pretty much is an open issue. It works "largely" for English, in particular since English sentences tend to stick to the SVO order. German for example is quite nasty here, as different word orders convey different emphasis (and thus can convey different meanings, in particular when irony is used). Additionally, German tends to use subordinate clauses much more.
NLP clearly is the way to go. At least some basic parser will be needed. It really depends on your task, too: do you need to make sure every one is correct, or is a probabilistic approach good enough? Can "difficult" cases be discarded or fed to a human for review? etc.
发布评论
评论(2)
如果你想要的只是形容词频率,那么问题相对简单,而不是一些残酷的、不太好的机器学习解决方案。
做什么?
在您的文本上添加 POS 标签。这会使用词性标签来注释您的文本,因此您的准确率将达到 95% 或更高。您可以使用 Stanford Parser online 标记您的文本,以感受它。解析器实际上还为您提供语法结构,但您只关心标记。
您还想确保句子被正确分解。为此,您需要一个断句器。它包含在斯坦福解析器等软件中。
然后,只需分解句子,给它们加上标签,然后用 ADJ 标签或他们使用的任何标签来计算所有内容。如果标签没有意义,请查找 Penn Treebank 标签集(Treebank 用于训练 NLP 工具,Penn Treebank 标签是常见的)。
怎么做?
Java 或 Python 是 NLP 工具的语言。 Python,使用 NLTK。它很简单、有据可查且易于理解。
对于Java,有GATE、LingPipe 和Stanford Parser 等。使用斯坦福解析器真是太痛苦了,幸运的是我已经受过苦了,所以如果你选择走那条路,你就不必这样做。请参阅我的 Google 页面,获取斯坦福大学的一些代码(位于页面底部)示例解析器。
Das all?
不,您可能还想词干形容词 - 这就是您获得单词根形式的地方:
cars ->; car
我实际上无法想到需要使用形容词的情况,但它可能会发生。当您查看输出时,您会很明显是否需要这样做。词性标注器/解析器/等将为您提供词干词(也称为引理)。
更多 NLP 解释
请参阅此问题。
If all you want is adjective frequencies, then the problem is relatively simple, as opposed to some brutal, not-so-good machine learning solution.
Wat do?
Do POS tagging on your text. This annotates your text with part of speech tags, so you'll have 95% accuracy or more on that. You can tag your text using the Stanford Parser online to get a feel for it. The parser actually also gives you the grammatical structure, but you only care about the tagging.
You also want to make sure the sentences are broken up properly. For this you need a sentence breaker. That's included with software like the Stanford parser.
Then just break up the sentences, tag them, and count all things with the tag ADJ or whatever tag they use. If the tags don't make sense, look up the Penn Treebank tagset (Treebanks are used to train NLP tools, and the Penn Treebank tags are the common ones).
How?
Java or Python is the language of NLP tools. Python, use NLTK. It's easy, well documented and well understood.
For Java, you have GATE, LingPipe and the Stanford Parser among others. It's a complete pain in the ass to use the Stanford Parser, fortunately I've suffered so you do not have to if you choose to go that route. See my google page for some code (at the bottom of the page) examples with the Stanford Parser.
Das all?
Nah, you might want to stem the adjectives too- that's where you get the root form of a word:
cars -> car
I can't actually think of a situation where this is necessary with adjectives, but it might happen. When you look at your output it'll be apparent if you need to do this. A POS tagger/parser/etc will get you your stemmed words (also called lemmas).
More NLP Explanations
See this question.
这取决于您的数据来源。如果句子来自某种生成器,您可能可以自动拆分它们。否则你将需要 NLP,是的。
正确解析自然语言几乎是一个悬而未决的问题。它“很大程度上”适用于英语,特别是因为英语句子往往遵循 SVO 顺序。例如,德语在这里就非常令人讨厌,因为不同的词序传达不同的重点(因此可以传达不同的含义,特别是在使用讽刺时)。此外,德语更倾向于使用从句。
NLP 显然是一条出路。至少需要一些基本的解析器。这实际上也取决于您的任务:您是否需要确保每一项都是正确的,或者概率方法是否足够好? “困难”的案例可以被丢弃或送人审查吗? ETC。
It depends on the source of your data. If the sentences come from some kind of generator, you can probably split them automatically. Otherwise you will need NLP, yes.
Properly parsing natural language pretty much is an open issue. It works "largely" for English, in particular since English sentences tend to stick to the SVO order. German for example is quite nasty here, as different word orders convey different emphasis (and thus can convey different meanings, in particular when irony is used). Additionally, German tends to use subordinate clauses much more.
NLP clearly is the way to go. At least some basic parser will be needed. It really depends on your task, too: do you need to make sure every one is correct, or is a probabilistic approach good enough? Can "difficult" cases be discarded or fed to a human for review? etc.