给定文档,选择相关片段

发布于 2024-09-01 14:45:12 字数 200 浏览 2 评论 0原文

当我在这里提出问题时,自动搜索返回的问题的工具提示给出了问题的前一点,但其中相当一部分没有给出任何比理解问题更有用的文本。标题。有谁知道如何制作一个过滤器来删除问题中无用的部分?

我的第一个想法是修剪仅包含某个列表中单词的任何前导句子(例如,停用词,加上标题中的单词,加上 SO 语料库中与标签相关性非常弱的单词,也就是说,它们同样可能出现在任何问题中,无论其标签如何)

When I ask a question here, the tool tips for the question returned by the auto search given the first little bit of the question, but a decent percentage of them don't give any text that is any more useful for understanding the question than the title. Does anyone have an idea about how to make a filter to trim out useless bits of a question?

My first idea is to trim any leading sentences that contain only words in some list (for instance, stop words, plus words from the title, plus words from the SO corpus that have very weak correlation with tags, that is that are equally likely to occur in any question regardless of it's tags)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夜空下最亮的亮点 2024-09-08 14:45:12

自动文本摘要

听起来您对自动文本摘要< /strong>。要对该问题、所涉及的问题和可用算法有一个很好的概述,请查看 Das 和 Martin 的论文 自动文本摘要调查 (2007)。

简单算法

一个简单但相当有效的摘要算法是从原始文本中选择有限数量的包含最频繁内容词的句子(即最常见的不包括 停用词列表 词)。

Summarizer(originalText, maxSummarySize):
   // start with the raw freqs, e.g. [(10,'the'), (3,'language'), (8,'code')...]
   wordFrequences = getWordCounts(originalText)
   // filter, e.g. [(3, 'language'), (8, 'code')...]
   contentWordFrequences = filtStopWords(wordFrequences)
   // sort by freq & drop counts, e.g. ['code', 'language'...]
   contentWordsSortbyFreq = sortByFreqThenDropFreq(contentWordFrequences)

   // Split Sentences
   sentences = getSentences(originalText)

   // Select up to maxSummarySize sentences
   setSummarySentences = {}
   foreach word in contentWordsSortbyFreq:
      firstMatchingSentence = search(sentences, word)
      setSummarySentences.add(firstMatchingSentence)
      if setSummarySentences.size() = maxSummarySize:
         break

   // construct summary out of select sentences, preserving original ordering
   summary = ""
   foreach sentence in sentences:
     if sentence in setSummarySentences:
        summary = summary + " " + sentence

   return summary

使用此算法进行摘要的一些开源包有:

Classifier4J (Java)

如果您使用 Java,则可以使用 Classifier4J的模块简单摘要器

使用此处找到的示例,我们假设原始文本是:

Classifier4J 是一个用于处理文本的 java 包。 Classifier4J 包括一个摘要器。摘要器允许对文本进行摘要。摘要器真的很酷。我认为没有其他 java 摘要器。

如以下代码片段所示,您可以轻松创建一个简单的一句话摘要:

// Request a 1 sentence summary
String summary = summariser.summarise(longOriginalText, 1);

使用上面的算法,这将生成 Classifier4J 包含一个摘要器。

NClassifier (C#)

如果您使用 C#,则有一个 Classifier4J 到 C# 的端口,名为 NClassifier

Tristan Havelick 的 NLTK 摘要器 (Python)

Classifier4J 的摘要器有一个正在开发中的 Python 端口,使用 Python 的 自然语言工具包 (NLTK) 可用此处

Automatic Text Summarization

It sounds like you're interested in automatic text summarization. For a nice overview of the problem, issues involved, and available algorithms, take a look at Das and Martin's paper A Survey on Automatic Text Summarization (2007).

Simple Algorithm

A simple but reasonably effective summarization algorithm is to just select a limited number of sentences from the original text that contain the most frequent content words (i.e., the most frequent ones not including stop list words).

Summarizer(originalText, maxSummarySize):
   // start with the raw freqs, e.g. [(10,'the'), (3,'language'), (8,'code')...]
   wordFrequences = getWordCounts(originalText)
   // filter, e.g. [(3, 'language'), (8, 'code')...]
   contentWordFrequences = filtStopWords(wordFrequences)
   // sort by freq & drop counts, e.g. ['code', 'language'...]
   contentWordsSortbyFreq = sortByFreqThenDropFreq(contentWordFrequences)

   // Split Sentences
   sentences = getSentences(originalText)

   // Select up to maxSummarySize sentences
   setSummarySentences = {}
   foreach word in contentWordsSortbyFreq:
      firstMatchingSentence = search(sentences, word)
      setSummarySentences.add(firstMatchingSentence)
      if setSummarySentences.size() = maxSummarySize:
         break

   // construct summary out of select sentences, preserving original ordering
   summary = ""
   foreach sentence in sentences:
     if sentence in setSummarySentences:
        summary = summary + " " + sentence

   return summary

Some open source packages that do summarization using this algorithm are:

Classifier4J (Java)

If you're using Java, you can use Classifier4J's module SimpleSummarizer.

Using the example found here, let's assume the original text is:

Classifier4J is a java package for working with text. Classifier4J includes a summariser. A Summariser allows the summary of text. A Summariser is really cool. I don't think there are any other java summarisers.

As seen in the following snippet, you can easily create a simple one sentence summary:

// Request a 1 sentence summary
String summary = summariser.summarise(longOriginalText, 1);

Using the algorithm above, this will produce Classifier4J includes a summariser..

NClassifier (C#)

If you're using C#, there's a port of Classifier4J to C# called NClassifier

Tristan Havelick's Summarizer for NLTK (Python)

There's a work-in-progress Python port of Classifier4J's summarizer built with Python's Natural Language Toolkit (NLTK) available here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文