Python 中 NLTK 的命名实体识别。识别网元
我需要将单词分类为词性。比如动词、名词、副词等等。 我用了
nltk.word_tokenize() #to identify word in a sentence
nltk.pos_tag() #to identify the parts of speech
nltk.ne_chunk() #to identify Named entities.
这个的输出是一棵树。 例如,
>>> sentence = "I am Jhon from America"
>>> sent1 = nltk.word_tokenize(sentence )
>>> sent2 = nltk.pos_tag(sent1)
>>> sent3 = nltk.ne_chunk(sent2, binary=True)
>>> sent3
Tree('S', [('I', 'PRP'), ('am', 'VBP'), Tree('NE', [('Jhon', 'NNP')]), ('from', 'IN'), Tree('NE', [('America', 'NNP')])])
当访问这棵树中的元素时,我按如下方式执行:
>>> sent3[0]
('I', 'PRP')
>>> sent3[0][0]
'I'
>>> sent3[0][1]
'PRP'
但是当访问命名实体时:
>>> sent3[2]
Tree('NE', [('Jhon', 'NNP')])
>>> sent3[2][0]
('Jhon', 'NNP')
>>> sent3[2][1]
Traceback (most recent call last):
File "<pyshell#121>", line 1, in <module>
sent3[2][1]
File "C:\Python26\lib\site-packages\nltk\tree.py", line 139, in __getitem__
return list.__getitem__(self, index)
IndexError: list index out of range
我收到上述错误。
我想要的是将输出设为“NE”,类似于之前的“PRP”,因此我无法识别哪个单词是命名实体。 有没有办法用 python 中的 NLTK 来做到这一点?如果是这样,请发布命令。或者树库中有一个函数可以做到这一点吗?我需要节点值“NE”
I need to classify words into their parts of speech. Like a verb, a noun, an adverb etc..
I used the
nltk.word_tokenize() #to identify word in a sentence
nltk.pos_tag() #to identify the parts of speech
nltk.ne_chunk() #to identify Named entities.
The out put of this is a tree.
Eg
>>> sentence = "I am Jhon from America"
>>> sent1 = nltk.word_tokenize(sentence )
>>> sent2 = nltk.pos_tag(sent1)
>>> sent3 = nltk.ne_chunk(sent2, binary=True)
>>> sent3
Tree('S', [('I', 'PRP'), ('am', 'VBP'), Tree('NE', [('Jhon', 'NNP')]), ('from', 'IN'), Tree('NE', [('America', 'NNP')])])
When accessing the element in this tree, i did it as follows:
>>> sent3[0]
('I', 'PRP')
>>> sent3[0][0]
'I'
>>> sent3[0][1]
'PRP'
But when accessing a Named Entity:
>>> sent3[2]
Tree('NE', [('Jhon', 'NNP')])
>>> sent3[2][0]
('Jhon', 'NNP')
>>> sent3[2][1]
Traceback (most recent call last):
File "<pyshell#121>", line 1, in <module>
sent3[2][1]
File "C:\Python26\lib\site-packages\nltk\tree.py", line 139, in __getitem__
return list.__getitem__(self, index)
IndexError: list index out of range
I got the above error.
What i want is to get the output as 'NE' similar to the previous 'PRP' so i cant identify which word is a Named Entity.
Is there any way of doing this with NLTK in python?? If so please post the command. Or is there a function in the tree library to do this? I need the node value 'NE'
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这个答案可能有偏差,在这种情况下我会删除它,因为我没有在这里安装 NLTK 来尝试它,但我认为你可以这样做:
sent3[2][0] 返回树的第一个子节点,而不是节点本身
编辑: 我回到家后尝试了这个,它确实有效。
This answer may be off base, and in which case I'll delete it, as I don't have NLTK installed here to try it, but I think you can just do:
sent3[2][0]
returns the first child of the tree, not the node itselfEdit: I tried this when I got home, and it does indeed work.
下面是我的代码:
Below is my code:
这会起作用
This will work
我同意 bdk
sent3[2].node
O/P - 'NE'
我认为 nltk 中没有函数可以做到这一点。上面的解决方案可以工作,但作为参考,您可以检查 这里
对于循环问题你可以这样做:-
我已经在nltk中执行了这个并且它工作正常..
I agree with bdk
sent3[2].node
O/P - 'NE'
I think there is no function in nltk to do it.Above solution will work but for reference you can check here
for looping problem you can do :-
I have executed this in nltk and it works fine..
现在sent3[2].node已经过时了。
使用sent3[2].label()代替
Now sent3[2].node is outdated.
use sent3[2].label() instead
您可以将句子视为一棵树并循环遍历它。
You can treat the sentence as a tree and loop through it.