Python 中 NLTK 的命名实体识别。识别网元

发布于 2024-11-02 17:10:12 字数 1331 浏览 0 评论 0原文

我需要将单词分类为词性。比如动词、名词、副词等等。我用了

nltk.word_tokenize() #to identify word in a sentence 
nltk.pos_tag()       #to identify the parts of speech
nltk.ne_chunk()      #to identify Named entities.

这个的输出是一棵树。例如，

>>> sentence = "I am Jhon from America"
>>> sent1 = nltk.word_tokenize(sentence )
>>> sent2 = nltk.pos_tag(sent1)
>>> sent3 =  nltk.ne_chunk(sent2, binary=True)
>>> sent3
Tree('S', [('I', 'PRP'), ('am', 'VBP'), Tree('NE', [('Jhon', 'NNP')]), ('from', 'IN'), Tree('NE', [('America', 'NNP')])])

当访问这棵树中的元素时，我按如下方式执行：

>>> sent3[0]
('I', 'PRP')
>>> sent3[0][0]
'I'
>>> sent3[0][1]
'PRP'

但是当访问命名实体时：

>>> sent3[2]
Tree('NE', [('Jhon', 'NNP')])
>>> sent3[2][0]
('Jhon', 'NNP')
>>> sent3[2][1]    
Traceback (most recent call last):
  File "<pyshell#121>", line 1, in <module>
    sent3[2][1]
  File "C:\Python26\lib\site-packages\nltk\tree.py", line 139, in __getitem__
    return list.__getitem__(self, index)
IndexError: list index out of range

我收到上述错误。

我想要的是将输出设为“NE”，类似于之前的“PRP”，因此我无法识别哪个单词是命名实体。有没有办法用 python 中的 NLTK 来做到这一点？如果是这样，请发布命令。或者树库中有一个函数可以做到这一点吗？我需要节点值“NE”

原文

I need to classify words into their parts of speech. Like a verb, a noun, an adverb etc..
I used the

nltk.word_tokenize() #to identify word in a sentence 
nltk.pos_tag()       #to identify the parts of speech
nltk.ne_chunk()      #to identify Named entities.

The out put of this is a tree.
Eg

>>> sentence = "I am Jhon from America"
>>> sent1 = nltk.word_tokenize(sentence )
>>> sent2 = nltk.pos_tag(sent1)
>>> sent3 =  nltk.ne_chunk(sent2, binary=True)
>>> sent3
Tree('S', [('I', 'PRP'), ('am', 'VBP'), Tree('NE', [('Jhon', 'NNP')]), ('from', 'IN'), Tree('NE', [('America', 'NNP')])])

When accessing the element in this tree, i did it as follows:

>>> sent3[0]
('I', 'PRP')
>>> sent3[0][0]
'I'
>>> sent3[0][1]
'PRP'

But when accessing a Named Entity:

>>> sent3[2]
Tree('NE', [('Jhon', 'NNP')])
>>> sent3[2][0]
('Jhon', 'NNP')
>>> sent3[2][1]    
Traceback (most recent call last):
  File "<pyshell#121>", line 1, in <module>
    sent3[2][1]
  File "C:\Python26\lib\site-packages\nltk\tree.py", line 139, in __getitem__
    return list.__getitem__(self, index)
IndexError: list index out of range

I got the above error.

What i want is to get the output as 'NE' similar to the previous 'PRP' so i cant identify which word is a Named Entity.
Is there any way of doing this with NLTK in python?? If so please post the command. Or is there a function in the tree library to do this? I need the node value 'NE'

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

千仐 2024-11-09 17:10:12

这个答案可能有偏差，在这种情况下我会删除它，因为我没有在这里安装 NLTK 来尝试它，但我认为你可以这样做：

   >>> sent3[2].node
   'NE'

sent3[2][0] 返回树的第一个子节点，而不是节点本身

编辑： 我回到家后尝试了这个，它确实有效。

This answer may be off base, and in which case I'll delete it, as I don't have NLTK installed here to try it, but I think you can just do:

   >>> sent3[2].node
   'NE'

sent3[2][0] returns the first child of the tree, not the node itself

Edit: I tried this when I got home, and it does indeed work.

回复收藏 0 原文

东风软 2024-11-09 17:10:12

下面是我的代码：

chunks = ne_chunk(postags, binary=True)
for c in chunks:
  if hasattr(c, 'node'):
    myNE.append(' '.join(i[0] for i in c.leaves()))

Below is my code:

chunks = ne_chunk(postags, binary=True)
for c in chunks:
  if hasattr(c, 'node'):
    myNE.append(' '.join(i[0] for i in c.leaves()))

回复收藏 0 原文

故事还在继续 2024-11-09 17:10:12

这会起作用

for sent in chunked_sentences:
  for chunk in sent:
    if hasattr(chunk, "label"):
        print(chunk.label())

This will work

for sent in chunked_sentences:
  for chunk in sent:
    if hasattr(chunk, "label"):
        print(chunk.label())

回复收藏 0 原文

躲猫猫 2024-11-09 17:10:12

我同意 bdk

sent3[2].node

O/P - 'NE'

我认为 nltk 中没有函数可以做到这一点。上面的解决方案可以工作，但作为参考，您可以检查这里

对于循环问题你可以这样做：-

 for i in range(len(sent3)):
     if "NE" in str(sent3[i]):
          print sent3[i].node

我已经在nltk中执行了这个并且它工作正常..

I agree with bdk

sent3[2].node

O/P - 'NE'

I think there is no function in nltk to do it.Above solution will work but for reference you can check here

for looping problem you can do :-

 for i in range(len(sent3)):
     if "NE" in str(sent3[i]):
          print sent3[i].node

I have executed this in nltk and it works fine..

回复收藏 0 原文

泪是无色的血 2024-11-09 17:10:12

现在sent3[2].node已经过时了。

使用sent3[2].label()代替

回复收藏 0 原文

一身骄傲 2024-11-09 17:10:12

您可以将句子视为一棵树并循环遍历它。

entities = nltk.ne_chunk(text)
for c in entities:
    # Is an entity
    if isinstance(elem, nltk.Tree):
        print('elem: ', elem.leaves(), elem.label())
    else:
       # Not an entity

You can treat the sentence as a tree and loop through it.

entities = nltk.ne_chunk(text)
for c in entities:
    # Is an entity
    if isinstance(elem, nltk.Tree):
        print('elem: ', elem.leaves(), elem.label())
    else:
       # Not an entity

回复收藏 0 原文

~没有更多了~