新闻项目(主题)相似度算法

发布于 2024-07-16 10:16:14 字数 245 浏览 16 评论 0原文

我想确定两个新闻项目内容的相似度,类似于 Google 新闻,但不同之处在于我希望能够确定基本主题是什么,然后确定哪些主题相关。

因此,如果一篇文章是关于萨达姆·侯赛因的,那么算法可能会推荐一些有关唐纳德·拉姆斯菲尔德在伊拉克的商业交易的内容。

如果你可以抛出像 k 最近邻这样的关键词,并解释一下它们为什么起作用(如果可以的话),我将完成其余的研究并调整算法。 只是寻找一个开始的地方,因为我知道那里肯定有人以前尝试过类似的东西。

I want to determine the similarity of the content of two news items, similar to Google news but different in the sense that I want to be able determine what the basic topics are then determine what topics are related.

So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.

If you can just throw around key words like k-nearest neighbours and a little explanation about why they work (if you can) I will do the rest of the reseach and tweak the algorithm. Just looking for a place to get started, since I know someone out there must have tried something similar before.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

甜是你 2024-07-23 10:16:15

第一个想法:

  • 扔掉干扰词(还有,你,是,这个,一些......)。
  • 计算所有其他单词并按数量排序。
  • 对于两篇文章中的每个单词,根据数量之和(或乘积或其他公式)添加分数。
  • 分数代表相似度。

一篇主要关于唐纳德·拉姆斯菲尔德的文章似乎会大量使用这两个词,这就是我在文章中对它们进行加权的原因。

然而,可能有一篇文章多次提到沃伦·巴菲特和一次比尔·盖茨,而另一篇文章则多次提到比尔·盖茨和微软。 那里的相关性是最小的。

根据您的评论:

因此,如果一篇文章是关于萨达姆·侯赛因的,那么算法可能会推荐有关唐纳德·拉姆斯菲尔德在伊拉克的商业交易的内容。

除非萨达姆的文章也提到伊拉克(或唐纳德),否则情况不会如此。

这就是我要开始的地方,我已经可以看到理论中潜在的漏洞(如果经常提到他们的名字,一篇关于比尔·盖茨的文章将与一篇关于比尔·克林顿的文章紧密匹配)。 所有其他词都可以很好地解决这一问题(一个法案是微软,另一个是希拉里)。

在尝试引入单词邻近功能之前,我可能会先对其进行测试运行,因为这会使它变得非常复杂(也许是不必要的)。

另一项可能的改进是保持“硬”关联(例如总是在涉及奥萨马·本·拉登的文章中添加“阿富汗”一词)。 但同样,这需要额外的维护,因为可能存在可疑的价值,因为有关奥萨马的文章几乎肯定也会提到阿富汗。

First thoughts:

  • toss away noise words (and, you, is, the, some, ...).
  • count all other words and sort by quantity.
  • for each word in the two articles, add a score depending on the sum (or product or some other formula) of the quantities.
  • the score represent the similarity.

It seems to be that an article primarily about Donald Rumsfeld would have those two words quite a bit, which is why I weight them in the article.

However, there may be an article mentioning Warren Buffet many times with Bill Gates once, and another mentioning both Bill Gates and Microsoft many times. The correlation there would be minimal.

Based on your comment:

So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.

that wouldn't be the case unless the Saddam article also mentioned Iraq (or Donald).

That's where I'd start and I can see potential holes in the theory already (an article about Bill Gates would match closely with an article about Bill Clinton if their first names are mentioned a lot). This may well be taken care of by all the other words (Microsoft for one Bill, Hillary for the other).

I'd perhaps give it a test run before trying to introduce word-proximity functionality since that's going to make it very complicated (maybe unnecessarily).

One other possible improvement would be maintaining 'hard' associations (like always adding the word Afghanistan to articles with Osama bin Laden in them). But again, that requires extra maintenance for possibly dubious value since articles about Osama would almost certainly mention Afghanistan as well.

指尖微凉心微凉 2024-07-23 10:16:15

此刻我正在思考这样的事情。

每个非噪声词都是一个维度。 每篇文章都由一个向量表示,其中未出现的单词用零表示,出现的单词的值等于它们出现的次数除以页面上的总单词数。 然后我可以采用该空间中每个点之间的欧几里德距离来获得任意两篇文章的相似度。

下一步是确定文章的聚类,然后确定每个聚类的中心点。 然后计算任意两个簇之间的欧几里得距离,从而给出主题的相似度。

啊啊我想通过输入它我解决了我自己的问题。 当然,只有在非常高的水平上,我相信当我认真对待它时,我会发现问题......魔鬼总是在细节中。

但评论和改进仍然受到高度赞赏。

At the moment I am thinking of something like this.

Each non-noise-word is a dimension. Each article is represented by a vector where the words that don't appear are represented by zero and those that do appear get a value that is equal to the number of times they appear divided by the total words on the page. Then I can take Euclidean distance between each of the points in this space to get the similarity of any two articles.

The next step would be to determine clusters of the articles, and then determine a central point for each cluster. Then compute the Euclidean distance between any two clusters which gives the similarity of the topics.

Baaah I think by typing it out I solved my own problem. Of course only in a very high level way, I am sure when I get down to it I will find problems ... the devil is always in the detail.

But comments and improvements still highly appreciated.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文