如何从很多页面中获取相似的文本?

发布于 2024-08-07 23:21:21 字数 105 浏览 10 评论 0原文

从大量文本中获取 x 个最相似的文本到一个文本。

也许将页面更改为文本会更好。

您不应该将文本与每个文本进行比较,因为它太慢了。

get the x most similar texts from a lot of texts to one text.

maybe change the page to text is better.

You should not compare the text to every text, because its too slow.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

预谋 2024-08-14 23:21:21

识别相似文档/页面的能力,无论是网页还是更一般的文本形式甚至代码,都有许多实际应用。这个主题在学术论文和不太专业的论坛中都有很好的体现。尽管文档相对丰富,但找到与特定案例相关的信息和技术可能很困难。

通过描述当前的具体问题和相关要求,也许可以为您提供更多指导。同时,以下提供了一些一般性想法。

许多不同的函数可用于以某种方式测量页面的相似性。选择这些功能中的一个(或可能几个)取决于多种因素,包括可以分配问题的时间和/或空间量以及噪声所需的容忍水平。

一些更简单的指标包括:

  • 最长公共单词序列的长度
  • 公共单词的数量
  • 超过 n 个单词的公共单词序列的
  • 数量 每个文档中前 n 个最常见单词的公共单词的数量。
  • 文档的长度

上面的一些指标在标准化时效果更好(例如,避免偏向长页面,因为长页面由于其庞大的尺寸更有可能与其他页面出现相似的单词)

更复杂和/或计算成本较高的测量是:

  • 编辑距离(这实际上是一个通用术语,因为有很多方法可以测量编辑距离。一般来说,其想法是测量将一个文本转换为另一个文本需要多少[编辑]操作。)
  • 源自 Ratcliff 的算法/Obershelp 算法(但计算单词而不是字母)
  • 基于线性代数的测量
  • 统计方法,例如贝叶斯拟合器

一般来说,我们可以区分测量/算法,其中大部分计算可以为每个文档完成一次,然后进行额外的传递比较或组合这些测量结果(相对较少的额外计算),而不是需要处理成对比较的文档的算法。

在选择一个(或者实际上是几个这样的度量,以及一些权重系数)之前,重要的是要考虑除了相似性度量本身之外的其他因素。例如,

  • 以某种方式标准化文本可能是有益的(特别是在网页的情况下,相似的页面内容或相似的段落看起来不太相似,因为与相关的所有“礼仪”)页面:页眉、页脚、广告面板、不同的标记等)
  • 利用标记(例如:与纯文本中发现的相似性相比,对标题或表格中发现的相似性给予更多权重。
  • 识别并消除与域相关的(甚至一般而言)例如,两个完全不同的文档可能看起来相似,因为它们有两个共同的“样板”段落,这些段落与某些法律免责声明或某些通用描述有关,但与每个文档内容的本质并不真正相关。

The ability of identifying similar documents/pages, whether web pages or more general forms of text or even of codes, has many practical applications. This topics is well represented in scholarly papers and also in less specialized forums. In spite of this relative wealth of documentation, it can be difficult to find the information and techniques relevant to a particular case.

By describing the specific problem at hand and associated requirements, it may be possible to provide you more guidance. In the meantime the following provides a few general ideas.

Many different functions may be used to measure, in some fashion, the similarity of pages. Selecting one (or possibly several) of these functions depends on various factors, including the amount of time and/or space one can allot the problem and also to the level of tolerance desired for noise.

Some of the simpler metrics are:

  • length of the longest common sequence of words
  • number of common words
  • number of common sequences of words of more than n words
  • number of common words for the top n most frequent words within each document.
  • length of the document

Some of the metrics above work better when normalized (for example to avoid favoring long pages which, through their sheer size have more chances of having similar words with other pages)

More complicated and/or computationally expensive measurements are:

  • Edit distance (which is in fact a generic term as there are many ways to measure the Edit distance. In general, the idea is to measure how many [editing] operations it would take to convert one text to the other.)
  • Algorithms derived from the Ratcliff/Obershelp algorithm (but counting words rather than letters)
  • Linear algebra-based measurements
  • Statistical methods such as Bayesian fitlers

In general, we can distinguish measurements/algorithms where most of the calculation can be done once for each document, followed by a extra pass aimed at comparing or combining these measurements (with relatively little extra computation), as opposed to the algorithms that require to deal with the documents to be compared in pairs.

Before choosing one (or indeed several such measures, along with some weighing coefficients), it is important to consider additional factors, beyond the similarity measurement per-se. for example, it may be beneficial to...

  • normalize the text in some fashion (in the case of web pages, in particular, similar page contents, or similar paragraphs are made to look less similar because of all the "decorum" associated with the page: headers, footers, advertisement panels, different markup etc.)
  • exploit markup (ex: giving more weight to similarities found in the title or in tables, than similarities found in plain text.
  • identify and eliminate domain-related (or even generally known) expressions. For example two completely different documents may appear similar is they have in common two "boiler plate" paragraphs pertaining to some legal disclaimer or some general purpose description, not truly associated with the essence of each cocument's content.
分分钟 2024-08-14 23:21:21

对文本进行分词、删除停用词并排列在术语向量中。计算tf-idf。将所有向量排列在矩阵中并计算它们之间的距离以查找相似的文档,例如使用杰卡德索引。

Tokenize texts, remove stop words and arrange in a term vector. Calculate tf-idf. Arrange all vectors in a matrix and calculate distances between them to find similar docs, using for example Jaccard index.

愿与i 2024-08-14 23:21:21

一切都取决于你所说的“相似”是什么意思。如果您的意思是“关于同一主题”,则查找匹配的 N-grams 通常有效很好。例如,只需从三元组到包含它们的文本创建一个映射,然后将所有文本中的所有三元组放入该映射中。然后,当您获得要匹配的文本时,在地图中查找其所有三元组并选择返回的最频繁的文本(可能会按长度进行一些标准化)。

All depends on what you mean by "similar". If you mean "about the same subject", looking for matching N-grams usually works pretty well. For example, just make a map from trigrams to the text that contains them, and put all trigrams from all of your texts into that map. Then when you get your text to be matched, look up all its trigrams in your map and pick the most frequent texts that come back (perhaps with some normalization by length).

马蹄踏│碎落叶 2024-08-14 23:21:21

I don't know what you mean by similar, but perhaps you ought to load your texts into a search system like Lucene and pose your 'one text' to it as a query. Lucene does pre-index the texts so it can quickly find the most similar ones (by its lights) at query-time, as you asked.

北渚 2024-08-14 23:21:21

您必须定义一个函数来测量两个页面之间的“差异”。我可以想象多种这样的功能,您必须为您的领域选择其中一个:

  • 关键字集差异 - 您可以修剪字典中最常见单词的文档,并且然后最终得到每个文档的唯一关键字列表。然后,差异函数将计算差异作为每个文档的关键字集的差异。

  • 文本差异 - 根据使用文本比较算法将一个文档转换为另一个文档所需的编辑次数来计算每个距离(请参阅文本差异算法

一旦有了差异函数,只需计算比较当前文档与其他文档的差异,然后返回最接近的其他文档。

如果您需要经常执行此操作并且您有很多文档,那么问题会变得更加困难。

You will have to define a function to measure the "difference" between two pages. I can imagine a variety of such functions, one of which you have to choose for your domain:

  • Difference of Keyword Sets - You can prune the document of the most common words in the dictionary, and then end up with a list of unique keywords per document. The difference funciton would then calculate the difference as the difference of the sets of keywords per document.

  • Difference of Text - Calculate each distance based upon the number of edits it takes to turn one doc into another using a text diffing algorithm (see Text Difference Algorithm.

Once you have a difference function, simply calculate the difference of your current doc with every other doc, then return the other doc that is closest.

If you need to do this a lot and you have a lot of documents, then the problem becomes a bit more difficult.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文