盲目地对传入数据的新趋势进行分类

发布于 2024-08-19 22:08:43 字数 224 浏览 12 评论 0原文

像谷歌新闻这样的新闻媒体如何自动对有关新兴主题的文档进行分类和排名,例如“奥巴马的 2011 年预算”?

我有一堆标记有棒球数据的文章,例如球员姓名和与文章的相关性(谢谢,opencalais),并且很想创建一个谷歌新闻风格的界面,对新帖子进行排名和显示,尤其是新兴帖子主题。我认为朴素贝叶斯分类器可以用一些静态类别进行训练,但这并不能真正允许跟踪诸如“该球员刚刚被交易到这支球队,这些其他球员也参与其中”之类的趋势。

how do news outlets like google news automatically classify and rank documents about emerging topics, like "obama's 2011 budget"?

i've got a pile of articles tagged with baseball data like player names and relevance to the article (thanks, opencalais), and would love to create a google news-style interface that ranks and displays new posts as they come in, especially emerging topics. i suppose that a naive bayes classifier could be trained w/ some static categories, but this doesn't really allow for tracking trends like "this player was just traded to this team, these other players were also involved."

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦晓ヶ微光ヅ倾城 2024-08-26 22:08:43

毫无疑问,Google 新闻可能会使用其他技巧(甚至是这些技巧的组合),但一种相对便宜的技巧,从计算上来说,从自由文本推断主题将利用 NLP 概念,即一个单词仅当与其他单词连接时才有意义
易于从多个文档中发现新主题类别的算法可以概述如下:

  • POS(词性)标记文本
    我们可能希望更多地关注名词,甚至更关注命名实体(例如奥巴马新英格兰
  • 标准化文本
    特别是用它们的共同词干替换屈折词。甚至可以用相应的命名实体替换一些形容词(例如:Parisian ==> Paris,legal ==> law)
    另外,删除干扰词和干扰表达。
  • 从手动维护的“当前/重复出现的热门词”列表中识​​别一些词(超级碗、选举、丑闻...)
    这可以在后续步骤中使用,为某些 N 元语法提供更多权重
  • 枚举每个文档中找到的所有 N 元语法(其中 N 为 1,例如 4 或 5)
    确保分别计算给定文档中每个 N 元语法的出现次数以及引用给定 N 元语法的文档数量
  • 最常引用的 N 元语法(即在最多文档中引用的 N 元语法)是可能是主题。
  • 识别现有主题(从已知主题列表中)
  • [可选] 手动查看新主题

还可以更改此一般配方以利用文档及其文本的其他属性。例如,文档来源(例如 cnn/sports 与 cnn/politics ...)可用于选择特定领域的词典。另一个示例,该过程可以或多或少地强调文档标题(或具有特定标记的文本的其他区域)中的单词/表达。

No doubt, Google News may use other tricks (or even a combination thereof), but one relatively cheap trick, computationally, to infer topics from free-text would exploit the NLP notion that a word gets its meaning only when connected to other words.
An algorithm susceptible of discovering new topic categories from multiple documents could be outlined as follow:

  • POS (part-of-speech) tag the text
    We probably want to focus more on nouns and maybe even more so on named entities (such as Obama or New England)
  • Normalize the text
    In particular replace inflected words by their common stem. Maybe even replace some adjectives by a corresponding Named Entity (ex: Parisian ==> Paris, legal ==> law)
    Also, remove noise words and noise expressions.
  • identify some words from a list of manually maintained "current / recurring hot words" (Superbowl, Elections, scandal...)
    This can be used in subsequent steps to provide more weight to some N-grams
  • Enumerate all N-grams found in each documents (where N is 1 to say 4 or 5)
    Be sure to count, separately, the number of occurrences of each N-gram within a given document and the number of documents which cite a given N-gram
  • The most frequently cited N-grams (i.e. the ones cited in the most documents) are probably the Topics.
  • Identify the existing topics (from a list of known topics)
  • [optionally] Manually review the new topics

This general recipe can also be altered to leverage other attributes of the documents and the text therein. For example the document origin (say cnn/sports vs. cnn/politics ...) can be used to select domain specific lexicons. Another example the process can more or less heavily emphasize the words/expressions from the document title (or other areas of the text with a particular mark-up).

不及他 2024-08-26 22:08:43

Google 新闻背后的主要算法已由 Google 研究人员在学术文献中发表:

The main algorithms behind Google News have been published in the academic literature by Google researchers:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文