如何根据上下文从文本中提取单词
我想从用户提供的文本语句中提取相关单词。 例如。对于“长方形有几条边?”的问题 这些词应该是“矩形”、“边数”、“许多”、“如何”。
我们发现我真正想做的是一个 NLP 问答系统。 但现在我只想从问题中提取所需的关键字, 问题的范围不是很大。
我遇到过各种数据挖掘工具,但不太确定它们是否真的对此有用。它们似乎有点太先进了或者不完全相关。
请告诉我是否有适合要求的工具,或者我应该继续尝试自己编码。
请提供您认为可能有帮助的任何指示。
I want to extract relevant words from a text statement provided by the user.
eg. For a question "How many sides are there in a rectangle?"
The words should be 'rectangles' , 'sides', 'many' , 'how'.
We've discovered that what exactly I'm aiming to do is a NLP Question answer system.
But right now I want to only extract the required keywords from the question,
The domain of the questions is not very vast.
I've come across various data mining tools but not very sure if they actually will be useful for this. They seem to be a bit too advanced or not exactly related.
Please let me know if there is any tool that suits the requirement or should I go on and try coding myself.
Please provide any kind of pointers, that you think might help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您只有问题,您可以尝试词性标记 (POS) 和命名实体提取 (NER)。名词尤其令人感兴趣。有许多相同的开源工具,Brill 的 POS tager、Lingpipe、Open NLP 等。但是,如果您还有您感兴趣的领域的语料库,您可以通过以下方式从中提取关键词和短语:使用单词和短语的频率与其他基础语料库相比的差异。给出一个问题,您就可以查找这些关键词和短语。
If all you have is just the questions, you can try part of speech tagging (POS) and named entity extraction (NER). The nouns in particular would be of interest. There are a number of open source tools for the same, Brill's POS tager, Lingpipe, Open NLP, etc. However if you also have a corpus from the domain that you are interested in, you can extract the key words and phrases from it by using how different the frequencies of the words and phrases are as compared to some other base corpus. Given a question you can then look for those key words and phrases.
除了 srean 建议使用 POS 标记和 NER 之外,许多人还使用搜索引擎工具(特别是 Lucene,但是存在其他几个)来回答问题。他们索引一组应包含答案的文档,使用问题作为查询,检索一组文档并过滤这些文档以找到答案。搜索引擎工具具有内置的术语权重。
这是基线设置;对于更高级的系统,他们对问题和文档进行各种预处理,包括停用词过滤、POS 标记、解析、NER、遗传算法等。
请参阅 本文 了解此设置的示例。
Apart from srean's advice to use POS tagging and NER, many people use search engine tools (specifically Lucene, but several other exist) to do question answering. They index a set of documents that should contain the answer, use the question as a query, retrieve a set of document and filter those to find the answer. Search engine tools have built-in term weighting.
That's the baseline setup; for more advanced systems, they do all kind of preprocessing on the question and the documents, including stop word filtering, POS tagging, parsing, NER, genetic algorithms, etc.
See this paper for an example of this setup.