Java Stanley NLP:语音标签的一部分?
斯坦福 NLP,在此处进行了演示,给出了这样的输出:
Colorless/JJ green/JJ ideas/NNS sleep/VBP furiously/RB ./.
What do the Part of语音标签是什么意思?我无法找到官方名单。是斯坦福大学自己的系统,还是他们使用通用标签? (例如,JJ
是什么?)
此外,例如,当我遍历句子、查找名词时,我最终会执行一些操作,例如检查标签 .contains ('N')
。这感觉很弱。有没有更好的方法来以编程方式搜索某个词性?
The Stanford NLP, demo'd here, gives an output like this:
Colorless/JJ green/JJ ideas/NNS sleep/VBP furiously/RB ./.
What do the Part of Speech tags mean? I am unable to find an official list. Is it Stanford's own system, or are they using universal tags? (What is JJ
, for instance?)
Also, when I am iterating through the sentences, looking for nouns, for instance, I end up doing something like checking to see if the tag .contains('N')
. This feels pretty weak. Is there a better way to programmatically search for a certain part of speech?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
宾夕法尼亚大学树库项目。查看词性标记 ps。
JJ是形容词。 NNS 是名词,复数。 VBP 是动词现在时。 RB是副词。
那是为了英语。对于中国人来说,这是宾大中文树库。对于德语来说,它是 NEGRA 语料库。
The Penn Treebank Project. Look at the Part-of-speech tagging ps.
JJ is adjective. NNS is noun, plural. VBP is verb present tense. RB is adverb.
That's for english. For chinese, it's the Penn Chinese Treebank. And for german it's the NEGRA corpus.
文档中每个标签的解释:
Explanation of each tag from the documentation:
上面接受的答案缺少以下信息:
还定义了 9 个标点符号标签(某些参考文献中未列出,请参阅 此处)。它们是:
The accepted answer above is missing the following information:
There are also 9 punctuation tags defined (which are not listed in some references, see here). These are:
以下是 Penn Treebank 的更完整标签列表(为了完整起见,在此发布):
http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html
它还包括子句和短语级别的标签。
子句级别
短语级别
(链接中的描述)
Here is a more complete list of tags for the Penn Treebank (posted here for the sake of completness):
http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html
It also includes tags for clause and phrase levels.
Clause Level
Phrase Level
(descriptions in the link)
编纂:
Codified:
我在这里提供整个列表,并提供参考链接
您可以找到词性标签的整个列表 这里。
I am providing the whole list here and also giving reference link
You can find out the whole list of Parts of Speech tags here.
关于查找特定 POS(例如,名词)标记的单词/块的第二个问题,这里是您可以遵循的示例代码。
输出是:
Regarding your second question of finding particular POS (e.g., Noun) tagged word/chunk, here is the sample code you can follow.
The output is:
它们似乎是布朗语料库标签。
They seem to be Brown Corpus tags.
用于其他语言的斯坦福 CoreNLP 标签:法语、西班牙语、德语 ...
我看到您使用英语语言的解析器,这是默认模型。
您可以将解析器用于其他语言(法语、西班牙语、德语...),并且请注意,每种语言的分词器和词性标注器都是不同的。如果您想这样做,则必须下载该语言的特定模型(例如使用 Maven 等构建器),然后设置您要使用的模型。
在这里您可以了解更多相关信息。
这里是不同语言的标签列表:
法语标签:
法语词性标签
法语短语类别标签:
法语语法功能:
Stanford CoreNLP Tags for Other Languages : French, Spanish, German ...
I see you use the parser for English language, which is the default model.
You may use the parser for other languages (French, Spanish, German ...) and, be aware, both tokenizers and part of speech taggers are different for each language. If you want to do that, you must download the specific model for the language (using a builder like Maven for example) and then set the model you want to use.
Here you have more information about that.
Here you are lists of tags for different languages :
TAGS FOR FRENCH:
Part of Speech Tags for French
Phrasal Categories Tags for French:
Syntactic Functions for French:
在 spacy 中,我认为速度非常快,在低端笔记本中,它将像这样运行:
几次试验中的输出:
所以,我认为您不需要担心每个 POS 标签检查的循环:)
更多禁用某些管道时我得到的改进:
所以,结果更快:
In spacy it was very fast i think, in just a low-end notebook it will run like this :
The Output in several trial :
So, I think you don't need to worry about the looping for each POS tag check :)
More improvement I got when disabled certain pipeline :
So, The result is faster :