给定文档,选择相关片段
当我在这里提出问题时,自动搜索返回的问题的工具提示给出了问题的前一点,但其中相当一部分没有给出任何比理解问题更有用的文本。标题。有谁知道如何制作一个过滤器来删除问题中无用的部分?
我的第一个想法是修剪仅包含某个列表中单词的任何前导句子(例如,停用词,加上标题中的单词,加上 SO 语料库中与标签相关性非常弱的单词,也就是说,它们同样可能出现在任何问题中,无论其标签如何)
When I ask a question here, the tool tips for the question returned by the auto search given the first little bit of the question, but a decent percentage of them don't give any text that is any more useful for understanding the question than the title. Does anyone have an idea about how to make a filter to trim out useless bits of a question?
My first idea is to trim any leading sentences that contain only words in some list (for instance, stop words, plus words from the title, plus words from the SO corpus that have very weak correlation with tags, that is that are equally likely to occur in any question regardless of it's tags)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
自动文本摘要
听起来您对自动文本摘要< /strong>。要对该问题、所涉及的问题和可用算法有一个很好的概述,请查看 Das 和 Martin 的论文 自动文本摘要调查 (2007)。
简单算法
一个简单但相当有效的摘要算法是从原始文本中选择有限数量的包含最频繁内容词的句子(即最常见的不包括 停用词列表 词)。
使用此算法进行摘要的一些开源包有:
Classifier4J (Java)
如果您使用 Java,则可以使用 Classifier4J的模块简单摘要器。
使用此处找到的示例,我们假设原始文本是:
如以下代码片段所示,您可以轻松创建一个简单的一句话摘要:
使用上面的算法,这将生成
Classifier4J 包含一个摘要器。
。NClassifier (C#)
如果您使用 C#,则有一个 Classifier4J 到 C# 的端口,名为 NClassifier
Tristan Havelick 的 NLTK 摘要器 (Python)
Classifier4J 的摘要器有一个正在开发中的 Python 端口,使用 Python 的 自然语言工具包 (NLTK) 可用此处。
Automatic Text Summarization
It sounds like you're interested in automatic text summarization. For a nice overview of the problem, issues involved, and available algorithms, take a look at Das and Martin's paper A Survey on Automatic Text Summarization (2007).
Simple Algorithm
A simple but reasonably effective summarization algorithm is to just select a limited number of sentences from the original text that contain the most frequent content words (i.e., the most frequent ones not including stop list words).
Some open source packages that do summarization using this algorithm are:
Classifier4J (Java)
If you're using Java, you can use Classifier4J's module SimpleSummarizer.
Using the example found here, let's assume the original text is:
As seen in the following snippet, you can easily create a simple one sentence summary:
Using the algorithm above, this will produce
Classifier4J includes a summariser.
.NClassifier (C#)
If you're using C#, there's a port of Classifier4J to C# called NClassifier
Tristan Havelick's Summarizer for NLTK (Python)
There's a work-in-progress Python port of Classifier4J's summarizer built with Python's Natural Language Toolkit (NLTK) available here.