Twitter 的热门话题算法如何决定从推文中提取哪些单词?

发布于 2024-08-16 20:31:28 字数 372 浏览 2 评论 0原文

我看到这个问题,重点关注“布兰妮·斯皮尔斯”问题。但我有一个不同的问题。算法如何确定哪些单词或短语需要排名?例如,如果我发送一条推文说“迈克尔·杰克逊去世了”,它如何知道提取“迈克尔·杰克逊”而不是“去世”?

或者假设亚历克·鲍德温和史蒂文·鲍德温那天出现在新闻中,因此在很多推文中都被提及。它怎么知道以不同的方式对待这两个名字而不是仅仅删除“鲍德温”?

天真地,我可以将这个问题视为 NP 完全问题(你必须将推文中的所有潜在短语与其他人的推文中的所有潜在短语进行比较)。

I saw this question, which focuses on the "Brittney Spears" problem. But I have a bit of a different question. How does the algorithm determine which words or phrases need to be ranked? For instance, if I send out a tweet that says "Michael Jackson died", how does it know to pull out "Michael Jackson" but not "died"?

Or suppose that Alec Baldwin and Steven Baldwin were in the news that day and thus were both mentioned in a lot of tweets. How would it know to treat both names differently instead of just pulling out "Baldwin"?

Done naively, I could see this problem as being NP-complete (you'd have to compare all potential phrases in the tweet with all potential phrases in everyone else's tweets).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

聽兲甴掵 2024-08-23 20:31:28

此问题的一般解决方案是使用 “词频、逆文档频率” (tf -idf)

这是一种统计方法,可以找到比其他单词/术语更相关的单词/术语,因为它们不经常出现。在这种情况下,与常见的英语单词“死”相比,“迈克尔·杰克逊”这个名字的频率可能非常低。

至于亚历克·鲍德温 (Alec Baldwin) 与史蒂文·鲍德温 (Steven Baldwin) - 这些将在 part-of 期间被识别为单独的-语音标记 - 它们将被标记为单独的专有名词。

A general solution to this problem is with "term frequency, inverse document frequency" (tf-idf).

It is a statistical approach which finds words/terms that are more relevant than others because they're not seen very often. In this case, the name "Michael Jackson" may have very low frequency compared to a common English word "died".

As for the Alec Baldwin vs. Steven Baldwin - these would be identified as separate during part-of-speech tagging - they would tagged as individual proper nouns.

恋竹姑娘 2024-08-23 20:31:28

我相信它会寻找常见的单词集。此外,他们似乎正在引用 http://www.whatthetrend.com/

除此之外,也可能涉及轻微的人为控制。

I believe it looks for common sets of words. Also, it appears that they are referencing http://www.whatthetrend.com/

In addition to this, there might be a slight human control involved too.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文