提取“有用”的内容句子中的信息?
我目前正在尝试理解这种形式的句子:
问题更多是机顶盒而不是电视。重新启动机顶盒解决了问题。
我对自然语言处理完全陌生,并开始使用 Python 的 NLTK 包来亲自动手。但是,我想知道是否有人可以向我概述实现这一目标所涉及的高级步骤。
我想做的是确定问题所在,在这种情况下,机顶盒
以及所采取的操作是否解决了问题,在这种情况下,是
code> 因为重新启动解决了问题。因此,如果所有句子都是这种形式,我的生活会更容易,但因为它是自然语言,所以句子也可以采用以下形式:
我看了看这辆车,发现没有任何问题。但是,我怀疑发动机有问题
所以在这种情况下,问题出在汽车
上。由于存在“可疑”一词,所采取的操作并未解决问题。潜在的问题可能出在引擎
上。
我并不是在寻找绝对的答案,因为我怀疑这非常复杂。我所寻找的更多的是一个高层次的概述,它将为我指明正确的方向。如果有更简单/替代的方法来做到这一点,也很受欢迎。
I am currently trying to understand sentences of this form:
The problem was more with the set-top box than the television. Restarting the set-top box solved the problem.
I am totally new to Natural Language Processing and started using Python's NLTK package to get my hands dirty. However, I am wondering if someone could give me an overview of the high-level steps involved in achieving this.
What I am trying to do is to identify what the problem was so in this case, set-top box
and whether the action that was taken resolved the problem so in this case, yes
because restarting fixed the problem. So if all the sentences were of this form, my life would have been easier but because it is natural language, the sentences could also be of the following form:
I took a look at the car and found nothing wrong with it. However, I suspect there is something wrong with the engine
So in this case, the problem was with the car
. The action taken did not resolve the problem because of the presence of the word suspect
. And the potential problem could be with the engine
.
I am not looking for an absolute answer as I suspect this is very complex. What I am looking for is more rather a high-level overview that will point me in the right direction. If there is an easier/alternate way to do this, that is welcome as well.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
实际上,您最希望的是 朴素贝叶斯分类器具有足够大(可能比您拥有的)训练集,并且愿意容忍公平的错误判定率。
寻求 NLP 的圣杯必然会让你有些不满意。
Really the best you could hope for is a Naive Bayesian Classifier with a sufficiently large (probably more than you have) training set and be willing to tolerate a fair rate of false determinations.
Seeking the holy grail of NLP is bound to leave you somewhat unsatisfied.
也许,如果句子格式良好,我会尝试 依赖解析 (http:// nltk.googlecode.com/svn/trunk/doc/api/nltk.parse.malt.MaltParser-class.html#raw_parse)。这将为您提供句子成分的图表,您可以了解词汇项之间的关系。稍后,您可以从依赖解析器的输出中提取短语(http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html#code-cfg2),这可以帮助您提取一个句子,或句子中的动词短语。
如果您只想从句子中获取短语或“块”,您可以尝试块解析器(http://nltk.googlecode.com/svn/trunk/doc/api/nltk.chunk-module.html)。您还可以进行命名实体识别(http://streamhacker.com/2009/02/23/chunk-extraction-with-nltk/)。它通常用于提取地点、组织或人名的实例,但它也适用于您的情况。
假设您解决了从句子中提取名词/动词短语的问题,您可能需要将它们过滤掉,以减轻领域专家的工作(太多的短语可能会让法官不知所措)。您可以对短语进行频率分析,删除通常与问题领域不相关的非常频繁的短语,或者编制白名单并保留包含预定义单词集的短语等。
Probably, if the sentences are well-formed, I would experiment with dependency parsing (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.parse.malt.MaltParser-class.html#raw_parse). That gives you a graph of the constituents of a sentence and you can tell the relations between the lexical items. Later, you can extract phrases from the output of a dependency parser (http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html#code-cfg2) That could help you to extract the direct object of a sentence, or the verb phrase in a sentence.
If you just want to get phrases or "chunks" from a sentence, you can try chunk parser (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.chunk-module.html). You can also carry out named entity recognition (http://streamhacker.com/2009/02/23/chunk-extraction-with-nltk/). It's usually used to extract instances of places, organizations or people names but it could work in your case as well.
Assuming that you solve the problem of extracting noun/verb phrases from a sentence, you may need to filter them out to ease the job of your domain expert (too many phrases could overwhelm a judge). You may carry out a frequency analysis on your phrases, remove very frequent ones that are not usually related to the problem domain, or compile a white-list and keep the phrases that contain a pre-defined set of words, etc.