计算两个列表之间的相似度

发布于 2025-01-07 08:45:07 字数 346 浏览 6 评论 0原文

编辑: 由于每个人都感到困惑,我想简化我的问题。我有两个有序列表。现在,我只想计算一个列表与另一个列表的相似程度。

例如,

1,7,4,5,8,9
1,7,5,4,9,6

衡量这两个列表之间相似性的好方法是什么,因此顺序很重要。例如,我们应该惩罚相似性,因为 4,5 在两个列表中交换了?

我有2个系统。一种最先进的系统和一种我实施的系统。给定一个查询,两个系统都会返回一个排序的文档列表。现在,我想比较我的系统和“最先进的系统”之间的相似性,以衡量我的系统的正确性。请注意,当我们讨论排名系统时,文档的顺序很重要。 有谁知道有什么措施可以帮助我找到这两个列表之间的相似之处。

EDIT:
as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other.

Eg,

1,7,4,5,8,9
1,7,5,4,9,6

What is a good measure of similarity between these two lists so that order is important. For example, we should penalize similarity as 4,5 is swapped in the two lists?

I have 2 systems. One state of the art system and one system that I implemented. Given a query, both systems return a ranked list of documents. Now, I want to compare the similarity between my system and the "state of the art system" in order to measure the correctness of my system. Please note that the order of documents is important as we are talking about a ranked system.
Does anyone know of any measures that can help me find the similarity between these two lists.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

浮生未歇 2025-01-14 08:45:07

DCG [贴现累积增益] 和 nDCG [标准化 DCG] 通常是排名列表的一个很好的衡量标准。

如果相关文档排名第一,它会提供完整的增益,并且增益会随着排名降低而降低。

使用 DCG/nDCG 与 SOA 基线进行比较来评估系统:

注意:如果您将“最先进的系统”返回的所有结果设置为相关,那么您的系统是相同的< /em> 如果他们使用 DCG/nDCG 获得相同的排名,则达到最先进的水平。

因此,可能的评估可能是:DCG(your_system)/DCG(state_of_the_art_system)

要进一步增强它,您可以给出相关性等级 [相关性不会是二进制的] -并将根据每个文档在现有技术中的排名来确定。例如,对于最先进的系统中的每个文档,rel_i = 1/log(1+i)

如果此评估函数收到的值接近 1:您的系统与基线非常相似。

示例:

mySystem = [1,2,5,4,6,7]
stateOfTheArt = [1,2,4,5,6,9]

首先,根据最先进的系统[使用上面的公式]:

doc1 = 1.0
doc2 = 0.6309297535714574
doc3 = 0.0
doc4 = 0.5
doc5 = 0.43067655807339306
doc6 = 0.38685280723454163
doc7 = 0
doc8 = 0
doc9 = 0.3562071871080222

现在您计算DCG(stateOfTheArt),并使用如上所述的相关性[注意这里相关性不是二进制的,并得到DCG(最先进的)= 2.1100933062283396

接下来,使用相同的相关权重为您的系统计算它并得到:DCG(mySystem) = 1.9784040064803783

因此,评估为 DCG(mySystem) /DCG(艺术水平) = 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939

The DCG [Discounted Cumulative Gain] and nDCG [normalized DCG] are usually a good measure for ranked lists.

It gives the full gain for relevant document if it is ranked first, and the gain decreases as rank decreases.

Using DCG/nDCG to evaluate the system compared to the SOA base line:

Note: If you set all results returned by "state of the art system" as relevant, then your system is identical to the state of the art if they recieved the same rank using DCG/nDCG.

Thus, a possible evaluation could be: DCG(your_system)/DCG(state_of_the_art_system)

To further enhance it, you can give a relevance grade [relevance will not be binary] - and will be determined according to how each document was ranked in the state of the art. For example rel_i = 1/log(1+i) for each document in the state of the art system.

If the value recieved by this evaluation function is close to 1: your system is very similar to the base line.

Example:

mySystem = [1,2,5,4,6,7]
stateOfTheArt = [1,2,4,5,6,9]

First you give score to each document, according to the state of the art system [using the formula from above]:

doc1 = 1.0
doc2 = 0.6309297535714574
doc3 = 0.0
doc4 = 0.5
doc5 = 0.43067655807339306
doc6 = 0.38685280723454163
doc7 = 0
doc8 = 0
doc9 = 0.3562071871080222

Now you calculate DCG(stateOfTheArt), and use the relevance as stated above [note relevance is not binary here, and get DCG(stateOfTheArt)= 2.1100933062283396

Next, calculate it for your system using the same relecance weights and get: DCG(mySystem) = 1.9784040064803783

Thus, the evaluation is DCG(mySystem)/DCG(stateOfTheArt) = 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939

只涨不跌 2025-01-14 08:45:07

Kendalls tau 是您想要的指标。它测量列表中成对反转的数量。斯皮尔曼尺尺的作用相同,但测量的是距离而不是倒转。它们都是为手头的任务而设计的,测量两个排序列表的差异。

Kendalls tau is the metric you want. It measures the number of pairwise inversions in the list. Spearman's foot rule does the same, but measures distance rather than inversion. They are both designed for the task at hand, measuring the difference in two rank-ordered lists.

冰葑 2025-01-14 08:45:07

文件清单是否详尽?也就是说,按系统 1 排序的每个文档等级是否也按系统 2 排序?如果是这样,Spearman 的 rho 可能会满足您的目的。当他们不共享相同的文档时,最大的问题是如何解释该结果。我认为没有一个衡量标准可以回答这个问题,尽管可能有一些衡量标准可以隐式回答这个问题。

Is the list of documents exhaustive? That is, is every document rank ordered by system 1 also rank ordered by system 2? If so a Spearman's rho may serve your purposes. When they don't share the same documents, the big question is how to interpret that result. I don't think there is a measurement that answers that question, although there may be some that implement an implicit answer to it.

玩套路吗 2025-01-14 08:45:07

正如您所说,您想要计算一个列表与另一个列表的相似程度。我认为简单地说,你可以从计算反转的数量开始。有一个 O(NlogN) 分而治之的方法来解决这个问题。这是衡量两个列表之间“相似性”的非常简单的方法。

例如,您想比较音乐网站上两个人的音乐品味有多“相似”,您可以对他们的一组歌曲进行排名并计算排名。其中的反转。数量越少,他们的口味越“相似”。

由于您已经将“最先进的系统”视为正确性的基准,因此计算反转应该为您提供排名“相似性”的基本衡量标准。
当然,这只是一个入门方法,但您可以在此基础上根据“反转间隙”等的严格程度进行构建。

    D1 D2 D3 D4 D5 D6
    -----------------
R1: 1, 7, 4, 5, 8, 9  [Rankings from 'state of the art' system]
R2: 1, 7, 5, 4, 9, 6  [ your Rankings]

由于排名是按文档顺序排列的,您可以根据 R1 编写自己的比较器函数(排名“最先进的系统”,因此计算与该比较器相比的反转,

您可以对找到的每个反转进行“惩罚”:i < j but R2[i]。 >'R2[j]
>'在这里您使用自己的比较器)

您可能会发现有用的链接:
链接1
链接2
Link3

As you said, you want to compute how similar one list is to the other. I think simplistically, you can start by counting the number of Inversions. There's a O(NlogN) divide and conquer approach to this. It is a very simple approach to measure the "similarity" between two lists.

e.g. you want to compare how 'similar' the music tastes are for two persons on a music website, you take their rankings of a set of songs and count the no. of inversions in it. Lesser the count, more 'similar' their taste is.

since you are already considering the "state of the art system" to be a benchmark of correctness, counting Inversions should give you a basic measure of 'similarity' of your ranking.
Of course this is just a starters approach, but you can build on it as how strict you want to be with the "inversion gap" etc.

    D1 D2 D3 D4 D5 D6
    -----------------
R1: 1, 7, 4, 5, 8, 9  [Rankings from 'state of the art' system]
R2: 1, 7, 5, 4, 9, 6  [ your Rankings]

Since rankings are in order of documents you can write your own comparator function based on R1 (ranking of the "state of the art system" and hence count the inversions comparing to that comparator.

You can "penalize" 'similarity' for each inversions found: i < j but R2[i] >' R2[j]
( >' here you use your own comparator)

Links you may find useful:
Link1
Link2
Link3

帅气称霸 2025-01-14 08:45:07

实际上,我知道为此目的有四种不同的措施。

已经提到了三个:

  • NDCG
  • Kendall's Tau
  • Spearman's Rho

但如果您有两个以上的等级需要比较,请使用 肯德尔的W。

I actually know four different measures for that purpose.

Three have already been mentioned:

  • NDCG
  • Kendall's Tau
  • Spearman's Rho

But if you have more than two ranks that have to be compared, use Kendall's W.

久伴你 2025-01-14 08:45:07

除了已经说过的内容之外,我想向您指出以下优秀论文: W. Webber 等人,不定排名的相似性度量(2010)。除了对现有度量(例如上述 Kendall Tau 和 Spearman 的尺规)进行了很好的回顾之外,作者还提出了一种直观且有吸引力的概率度量,该度量适用于不同长度的结果列表以及并非所有项目都出现在两个列表中的情况。粗略地说,它由用户在检查了项目k之后扫描项目k+1(而不是放弃)的“持续”概率p来参数化。 排名偏差重叠 (RBO) 是用户停止阅读时结果的预期重叠率。

RBO 的实现稍微复杂一些;您可以查看 Apache Pig 中的实现 此处

另一个简单的度量是余弦相似度,即两个向量之间的余弦,其维度对应于项目,并以逆排名作为权重。但是,它不能优雅地处理仅出现在列表之一中的项目(请参阅上面链接中的实现)。

  1. 对于列表 1 中的每个项目 i,令 h_1(i) = 1/rank_1(i)。对于列表 2 中未出现在列表 1 中的每个项目 i,令 h_1(i) = 0。对列表 2 的 h_2 执行相同操作。
  2. 计算 v12 = sum_i h_1(i) * h_2(i); v11 = sum_i h_1(i) * h_1(i); v22 = sum_i h_2(i) * h_2(i)
  3. Return v12 / sqrt(v11 * v22)

对于您的示例,这给出的值为 0.7252747。

除了您当前的问题之外,请让我给您一些实用的建议。除非你的“生产系统”基线是完美的(或者我们正在处理黄金套装),否则比较质量度量(例如上面提到的 nDCG)几乎总是比相似性更好;新的排名有时会比基准更好,有时会更差,并且您想知道前一种情况是否比后者更频繁地发生。其次,相似性度量在绝对尺度上的解释并不是微不足道的。例如,如果您得到的相似度得分为 0.72,这是否意味着它确实相似或显着不同?相似性度量更有助于说明新的排序方法 1 比另一种新的排序方法 2 更接近生产。

In addition to what has already been said, I would like to point you to the following excellent paper: W. Webber et al, A Similarity Measure for Indefinite Rankings (2010). Besides containing a good review of existing measures (such as above-mentioned Kendall Tau and Spearman's footrule), the authors propose an intuitively appealing probabilistic measure that is applicable for varying length of result lists and when not all items occur in both lists. Roughly speaking, it is parameterized by a "persistence" probability p that a user scans item k+1 after having inspected item k (rather than abandoning). Rank-Biased Overlap (RBO) is the expected overlap ratio of results at the point the user stops reading.

The implementation of RBO is slightly more involved; you can take a peek at an implementation in Apache Pig here.

Another simple measure is cosine similarity, the cosine between two vectors with dimensions corresponding to items, and inverse ranks as weights. However, it doesn't handle items gracefully that only occur in one of the lists (see the implementation in the link above).

  1. For each item i in list 1, let h_1(i) = 1/rank_1(i). For each item i in list 2 not occurring in list 1, let h_1(i) = 0. Do the same for h_2 with respect to list 2.
  2. Compute v12 = sum_i h_1(i) * h_2(i); v11 = sum_i h_1(i) * h_1(i); v22 = sum_i h_2(i) * h_2(i)
  3. Return v12 / sqrt(v11 * v22)

For your example, this gives a value of 0.7252747.

Please let me give you some practical advice beyond your immediate question. Unless your 'production system' baseline is perfect (or we are dealing with a gold set), it is almost always better to compare a quality measure (such as above-mentioned nDCG) rather than similarity; a new ranking will be sometimes better, sometimes worse than the baseline, and you want to know if the former case happens more often than the latter. Secondly, similarity measures are not trivial to interpret on an absolute scale. For example, if you get a similarity score of say 0.72, does this mean it is really similar or significantly different? Similarity measures are more helpful in saying that e.g. a new ranking method 1 is closer to production than another new ranking method 2.

动听の歌 2025-01-14 08:45:07

我想你正在谈论比较两个信任我的信息检索系统并不是一件小事。这是一个复杂的计算机科学问题。

为了衡量相关性或进行 A/B 测试,您需要具备以下条件:

  1. 衡量相关性的竞争对手。由于您有两个系统,因此满足了这个先决条件。

  2. 您需要手动对结果进行评分。您可以要求您的同事对热门查询的查询/网址对进行评级,然后对漏洞进行评级(即未评级的查询/网址对,您可以通过使用“学习排名”算法http://en.wikipedia.org/wiki/Learning_to_rank。不要对此感到惊讶,但事实确实如此(请阅读下面是 Google/Bing 的示例)。

rel="nofollow">http://en.wikipedia.org/wiki/Learning_to_rank。不要对此感到惊讶,但这是事实(请阅读下面的 Google/ Bing 在横向搜索市场中,这些搜索引擎在世界各地雇用人工法官,并在他们身上投入数百万美元,以对查询结果进行评分。因此,对于每个查询/URL 对,通常会对前 3 或前 5 个结果进行评级。根据这些评级,他们可能会使用 NDCG(标准化贴现累积收益)等指标,这是最好的指标之一,也是最受欢迎的指标之一。

根据维基百科:

折扣累积增益(DCG)是衡量网络搜索引擎算法或相关应用程序有效性的指标,通常用于信息检索。 DCG 使用搜索引擎结果集中文档的分级相关性等级,根据文档在结果列表中的位置来衡量文档的有用性或增益。增益从结果列表的顶部到底部累积,每个结果的增益在较低的排名中打折。

维基百科对 NDCG 做了很好的解释。文章很短,请仔细阅读。

I suppose you are talking about comparing two Information Retrieval System which trust me is not something trivial. It is a complex Computer Science problem.

For measuring relevance or doing kind of A/B testing you need to have couple of things:

  1. A competitor to measure relevance. As you have two systems than this prerequisite is met.

  2. You need to manually rate the results. You can ask your colleagues to rate query/url pairs for popular queries and then for the holes(i.e. query/url pair not rated you can have some dynamic ranking function by using "Learning to Rank" Algorithm http://en.wikipedia.org/wiki/Learning_to_rank. Dont be surprised by that but thats true (please read below of an example of Google/Bing).

Google and Bing are competitors in the horizontal search market. These search engines employ manual judges around the world and invest millions on them, to rate their results for queries. So for each query/url pairs generally top 3 or top 5 results are rated. Based on these ratings they may use a metric like NDCG (Normalized Discounted Cumulative Gain) , which is one of finest metric and the one of most popular one.

According to wikipedia:

Discounted cumulative gain (DCG) is a measure of effectiveness of a Web search engine algorithm or related applications, often used in information retrieval. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks.

Wikipedia explains NDCG in a great manner. It is a short article, please go through that.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文