当前位置：文江博客话题详情

分解/分解 nltk 中的复杂和复合句子

发布于 2024-09-14 20:07:19 字数 110 浏览 8 评论 0原文

nltk或其他自然语言处理库中有没有办法将复杂句子分解为简单句子？

例如：

夕阳西下、微风徐徐的时候，公园真是太美妙了==>太阳正在落山。一阵凉风吹来。公园真是太棒了。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半步萧音过轻尘 2024-09-21 20:07:19

这比看起来要复杂得多，因此您不太可能找到完全干净的方法。

但是，使用 OpenNLP 中的英语解析器，我可以采用您的示例句子并获得以下语法树

  (S
    (NP (DT The) (NN park))
    (VP
      (VBZ is)
      (ADJP (RB so) (JJ wonderful))
      (SBAR
        (WHADVP (WRB when))
        (S
          (S (NP (DT the) (NN sun)) (VP (VBZ is) (VP (VBG setting))))
          (CC and)
          (S
            (NP (DT a) (JJ cool) (NN breeze))
            (VP (VBZ is) (VP (VBG blowing)))))))
    (. .)))

：，你可以随意挑选。您可以通过提取顶级 (NP *)(VP *) 减去 (SBAR *) 部分来获取子条款。然后您可以将 (SBAR *) 内的连词拆分为其他两个语句。

请注意，OpenNLP 解析器是使用 Penn Treebank 语料库进行训练的。我对您的示例句子进行了相当准确的解析，但解析器并不完美，并且在其他句子上可能会出现严重错误。查看此处了解其标签的说明。它假设您已经对语言学和英语语法有一些基本的了解。

编辑：顺便说一句，这就是我从 Python 访问 OpenNLP 的方式。这假设您在 opennlp-tools-1.4.3 文件夹中有 OpenNLP jar 和模型文件。

import os, sys
from subprocess import Popen, PIPE
import nltk

BP = os.path.dirname(os.path.abspath(__file__))
CP = "%(BP)s/opennlp-tools-1.4.3.jar:%(BP)s/opennlp-tools-1.4.3/lib/maxent-2.5.2.jar:%(BP)s/opennlp-tools-1.4.3/lib/jwnl-1.3.3.jar:%(BP)s/opennlp-tools-1.4.3/lib/trove.jar" % dict(BP=BP)
cmd = "java -cp %(CP)s -Xmx1024m opennlp.tools.lang.english.TreebankParser -k 1 -d %(BP)s/opennlp.models/english/parser" % dict(CP=CP, BP=BP)
p = Popen(cmd, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE, close_fds=True)
stdin, stdout, stderr = (p.stdin, p.stdout, p.stderr)
text = "This is my sample sentence."
stdin.write('%s\n' % text)
ret = stdout.readline()
ret = ret.split(' ')
prob = float(ret[1])
tree = nltk.Tree.parse(' '.join(ret[2:]))

This is much more complicated than it seems, so you're unlikely to find a perfectly clean method.

However, using the English parser in OpenNLP, I can take your example sentence and get a following grammar tree:

  (S
    (NP (DT The) (NN park))
    (VP
      (VBZ is)
      (ADJP (RB so) (JJ wonderful))
      (SBAR
        (WHADVP (WRB when))
        (S
          (S (NP (DT the) (NN sun)) (VP (VBZ is) (VP (VBG setting))))
          (CC and)
          (S
            (NP (DT a) (JJ cool) (NN breeze))
            (VP (VBZ is) (VP (VBG blowing)))))))
    (. .)))

From there, you can pick it apart as you like. You can get your sub-clauses by extracting the top-level (NP *)(VP *) minus the (SBAR *) section. And then you could split the conjunction inside the (SBAR *) into the other two statements.

Note, the OpenNLP parser is trained using the Penn Treebank corpus. I obtained a pretty accurate parsing on your example sentence, but the parser isn't perfect and can be wildly wrong on other sentences. Look here for an explanation of its tags. It assumes you already have some basic understanding of linguistics and English grammar.

Edit: Btw, this is how I access OpenNLP from Python. This assumes you have the OpenNLP jar and model files in a opennlp-tools-1.4.3 folder.

import os, sys
from subprocess import Popen, PIPE
import nltk

BP = os.path.dirname(os.path.abspath(__file__))
CP = "%(BP)s/opennlp-tools-1.4.3.jar:%(BP)s/opennlp-tools-1.4.3/lib/maxent-2.5.2.jar:%(BP)s/opennlp-tools-1.4.3/lib/jwnl-1.3.3.jar:%(BP)s/opennlp-tools-1.4.3/lib/trove.jar" % dict(BP=BP)
cmd = "java -cp %(CP)s -Xmx1024m opennlp.tools.lang.english.TreebankParser -k 1 -d %(BP)s/opennlp.models/english/parser" % dict(CP=CP, BP=BP)
p = Popen(cmd, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE, close_fds=True)
stdin, stdout, stderr = (p.stdin, p.stdout, p.stderr)
text = "This is my sample sentence."
stdin.write('%s\n' % text)
ret = stdout.readline()
ret = ret.split(' ')
prob = float(ret[1])
tree = nltk.Tree.parse(' '.join(ret[2:]))

回复收藏 0 原文

~没有更多了~