判断一个句子是否是一个询问句
如何检测搜索查询是否采用问题的形式?
例如,客户可能会搜索“如何跟踪我的订单”(注意没有问号)。
我猜最直接的问题会符合特定的语法。
非常简单的猜测方法:
START WORDS = [who, what, when, where, why, how, is, can, does, do]
isQuestion(sentence):
sentence ends with '?'
OR sentence starts with one of START WORDS
起始词列表可能会更长。范围是一个网站搜索框,所以我想该列表不需要包含太多单词。
有没有一个库可以比我简单的猜测方法做得更好?我的方法有什么改进吗?
How can I detect if a search query is in the form of a question?
For example, a customer might search for "how do I track my order" (notice no question mark).
I'm guessing most direct questions would conform to a particular grammar.
Very simple guessing approach:
START WORDS = [who, what, when, where, why, how, is, can, does, do]
isQuestion(sentence):
sentence ends with '?'
OR sentence starts with one of START WORDS
START WORDS list could be longer. The scope is a website search box, so I imagine the list shouldn't need to include too many words.
Is there a library that can do this better than my simple guessing approach? Any improvements on my approach?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
另请参阅:如何判断一个句子是否是一个问题(疑问)?
我对这个问题的回答:
在问题的语法分析中(通过 nltk 等工具包获得),正确的结构将采用以下形式:
因此,使用任何语法如果解析器可用,则带有嵌入 SQ(可选)的 SBARQ 节点的树将指示输入是问题。 WH+节点(WHNP/WHADVP/WHADJP)包含问题词干(who/what/when/where/why/how),SQ 包含倒装短语。
即:
当然,有很多前面的子句会导致解析错误(可以解决),就像写得很糟糕的问题一样。例如这篇文章的标题“如何判断一个句子是否是一个问题?”将有 SBARQ,但没有 SQ。
See also: How to find out if a sentence is a question (interrogative)?
My answer from that question:
In a syntactic parse of a question (obtained through a toolkit like nltk), the correct structure will be in the form of:
So, using anyone of the syntactic parsers available, a tree with an SBARQ node having an embedded SQ (optionally) will be an indicator the input is a question. The WH+ node (WHNP/WHADVP/WHADJP) contains the question stem (who/what/when/where/why/how) and the SQ holds the inverted phrase.
i.e.:
Of course, having a lot of preceeding clauses will cause errors in the parse (that can be worked around), as will really poorly-written questions. For example, the title of this post "How to find out if a sentence is a question?" will have an SBARQ, but not an SQ.
您将需要一种更高级的语言分析形式来实现这一目标。需要证明吗?好的...
You are going to need a much more advanced form of linguistic analysis to pull this off. Need Proof? Okay...
要识别问题句的起始词,您应该浏览大型文本语料库,查找以
?
结尾的句子,并找出在其中找到的最常见的起始词。您错过的一些内容包括 WHICH、AM、ARE、WAS、WERE、MAY、MIGHT、CAN、CULD、WILL、SHALL、WOULD、SHOULD、HAS、HAVE、HAD 和 DID。也许还可以将 IF 与 WHEN 一起使用。还要考虑 IN、AT、TO、FROM 和 ON,也许还有 UNDER 和 OVER。一切都取决于您拥有的查询系统的类型以及您希望为用户提供多少自然语言查询的自由度。
同样,您应该以同样的方式检查人们已经提出的所有问题,找出他们的哪些问题实际上确实以
?
结尾,以帮助识别类似的问题不要。这应该会发现很多疑问;势在必行也是一种可能性吗?
根据您想要的效果,您可以考虑使用 Wordnet 之类的东西作为词性标记的开始。它主要用于同义词集,包括上位词、下位词、全名和分词信息,但我相信它也会有您正在寻找的其他信息。
维基百科有几篇关于问答和自然语言搜索引擎。两者都有您可能愿意寻求的参考资料。您还可以浏览一下这些 PDF 论文:
这里的观点”
应用于文档检索系统”。
最后,麻省理工学院的START 自然语言问答系统似乎很有趣。
To identify start-words on question sentences, you should go through a large text corpus looking for sentences that end in a
?
, and figure out the most frequent start-words you find in those.A few you missed that come to mind include WHICH, AM, ARE, WAS, WERE, MAY, MIGHT, CAN, COULD, WILL, SHALL, WOULD, SHOULD, HAS, HAVE, HAD, and DID. Perhaps also IF to go with WHEN. Also consider IN, AT, TO, FROM, and ON, plus maybe UNDER and OVER. All depends on the sort of query system you have and how much latitude in natural language queries you hope to provide your users with.
Similarly, you should examine all your own queries that people have already made in the same light, finding which of their questions actually do end in a
?
to help identify similar ones which do not.That should find a lot of the interrogatives; are imperatives also a possibility?
Depending how fancy you want to get, you might consider using something like Wordnet as a start of part-of-speech tagging. It’s mostly for synonym sets, including hypernym, hyponym, holonym, and meronym information, but I believe it will have the other information you’re looking for as well.
Wikipedia has a couple of articles on question answering and natural language search engines. Both have references you might care to pursue. You might also glance through these PDF papers:
the view from here”
Applied to a Document Retrieval System”.
Lastly, the START Natural Language Question Answering System from MIT seems interesting.
找出一个句子是否是一个问题并不是一件容易的任务,因为人们提出问题的方式有很多种,其中许多不遵循语法规则。因此很难找到一个好的检测规则集。在这种情况下,我会进行机器学习并使用带注释的文本语料库训练算法(创建语料库并选择功能集可能需要一些时间)。基于机器学习的识别应该比基于规则的方法为您提供更好的回忆。以下是分步说明:
根据提取的信息为每个特征句子(您需要正面和负面示例)创建一个向量,例如,
|有 ? |第二个位置的动词 |已5W1H | 5W1H 是在句子中的第 1 个位置吗? ... |句子长度|是一个问题 |
使用向量训练机器学习算法,例如MaximumEntropy、SVM (您可以使用 Wekka 或 Knime)
使用经过训练的算法进行问题识别。
如果需要(新问题示例),请重复步骤。
Finding out if a sentence is a question is not an easiest task, because there is many ways how people asks questions, many of them do not follows grammar rules. Therefore it is hard to find a good rule set for the detection. In such situations, I would go for machine learning and train an algorithm using annotated text corpus (creating a corpus and selecting a feature set can take some time). The machine learning based recognition should provide you better recall than the rule based approach. Here is a step by step instruction:
Create a vector for each of sentences of features (you need both, positive and negative examples) based on the extracted informaiton, e.g.,
| Has ? | A verb on second position | Has 5W1H | Is 5W1H on 1st position in sentence | ... | length of sentence | Is a question |
Use the vectors to train a machine learning algorithm, e.g., MaximumEntropy, SVM (you can use Wekka or Knime)
Use the trained algorithm for the question recognition.
If needed (new question examples), repeat steps.
为了支持 JohnFx 的答案,情况变得更糟。以下显然是问题:
然后您会发现用户开始输入以下类型的查询:
这还是一个问题吗?从语法上看,不,但它确实值得一个可以轻松称为答案的答复。 (这些类型的查询可能很常见,具体取决于您的用户群体。)
底线:如果您不打算以特殊的、语言上复杂的方式处理问题(例如使用自然语言生成构建直接答案),认识它们甚至可能并不有趣。从查询中选择正确的关键词可能会更有价值。
In support of JohnFx's answer, it gets even worse. The following are clearly questions:
And then you'll find that users start entering the following kind of queries:
Is that even a question? Syntactically, no, but it does deserve a reply that could easily be termed an answer. (These kinds of queries may be quite common, depending on your user population.)
Bottom-line: if you're not going to handle questions in a special, linguistically sophisticated way (such as construct a direct answer using natural language generation), recognizing them may not even be interesting. Picking the right keywords from the query may be much more rewarding.
我尝试了一下......我的目标是做一些轻量级的事情,不需要额外的库,并且让每个开发人员能够控制一些必要的元素 - 例如填充某些字符,使用负缩写作为第一个单词仅位置,并允许常见问题元素。我创建了两个函数,当你从 Angular6 HTML 页面传入一个值时,它在我的大多数情况下都做得很好......
我不包括“不”作为起始词,因为它可以是陈述和问题的次数一样多。你不觉得吗?
Angular HTML:
.ts 功能:
I took a stab at this... my goal was to do something lightweight that wouldn't require additional libraries, and would give each developer the ability to control a few necessary elements - such as padding certain chars, using negative contractions as first word position only, and allowing for common question elements. I created two functions, that when you pass in a value from an Angular6 HTML page, it does a pretty good job for most of my cases...
I don't include "don't" as a starter word because it can be a statement as many times as a question. Don't you think?
Angular HTML:
.ts functions: