有 MongoDB 热门话题 Gem 吗?
我在 MongoDB 中有一组文档,其“描述”值大约是一条推文的大小。我需要从中生成一个热门主题列表。显然这是一个已解决的问题,但我无法找到一个明确的答案/宝石来完成工作而不自己编写代码。
我正在使用 ruby &我的应用程序中的 mongoid。
有没有任何红宝石可以帮助解决或处理这个问题?谢谢。
I have a group of documents in MongoDB with a "description" value about the size of a tweet. I need to generate a trending topics list from this. Clearly this is a solved problem but I can't find a definitive answer/gem for getting the job done without writing the code myself.
I am using ruby & mongoid in my app.
Is there any ruby gem that will help with or handle this? Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我知道没有这样的宝石,但这里有一个您可以自己编写的算法:
Extract n-克 来自文本。由于文本很小(您所说的推文大小),因此提取所有 n 元语法,这里没有限制。
“我吃冰淇淋”=> {(我),(吃),(冰淇淋),(我吃),(吃冰淇淋),(我吃冰淇淋)}
计算TF-IDF 每个文本 n 元语法的权重向量
{(我):0.1, (吃):0.01, (冰淇淋):0.2, (我吃):0.12, (吃冰淇淋):0.001, (我吃冰淇淋):0.00012}
使用 余弦相似度作为 对向量进行增量聚类算法,也许编写Weka 库,基于 JRuby
按总体大小对所有集群进行排序。最大集群中心的 n 元语法是您的热门主题。
I know of no such gem, but here's an algorithm you may write for yourself:
Extract n-grams from texts. Since texts are small (tweet size you said) extract all n-grams, no limit here.
"I eat icecream" => {(I), (eat), (icecream), (I eat), (eat icecream), (I eat icecream)}
Compute TF-IDF weight vectors for each text's n-grams
{(I):0.1, (eat):0.01, (icecream):0.2, (I eat):0.12, (eat icecream):0.001, (I eat icecream):0.00012}
Use cosine similarity as a measure function for a incremental clustering algorithm over your vectors, maybe script the Weka library over JRuby
Order all clusters by the population size. The n-grams in the centers of largest clusters are your trendy topics.
快速搜索 rubygems.org 就会发现您将需要进行一些编程。这是一件好事,因为一般检测趋势的系统要么难以设置和调整,要么很难猜测应用程序中“趋势”的决定因素。
我将对您的申请做出一些假设。
我们假设用户使用井号标签 (#) 对他们的推文进行自我分类。另外,让我们继续说这些主题标签的排序计数将确定某个主题是否是趋势。
现在我们来谈谈计算机科学部分。根据我们上面的假设,您将需要能够快速查询和排序主题标签集合,以找出趋势。
您正在使用 MongoDB 和 mongoid(带有 Rails),因此最简单的方法是创建一个包含标签文档的集合,其中包含其使用计数。在标签和计数上创建索引。
当有人发推文时,找出哈希标签是什么,在标签集合中查找它们并增加它们的计数。要了解趋势,请查询标签集合并按计数排序。这将为您提供所有时间的热门话题标签。
如果您想获得更具体的信息,则不只是存储计数,还可以存储按时间增量(周、天、小时等)细分的计数,或者将它们单独存储。您可以创建代表时间增量的文档,而不是单个标签,并存储所有标签及其计数。
您还可以使用上限集合。希望有帮助,所有这一切实际上取决于您想要做什么。你可以变得非常疯狂,并计算随时间衰减的趋势等。你可以阅读 reddit 或黑客新闻代码来很好地了解那是什么样的。
A quick search of rubygems.org revelead that you are going to have to do some programming. This is a good thing as a system to generically detect trends would either be hopelessly difficult to setup and tune or awful at guessing what dictates a "trend" in your application.
I'm going to make some assumptions about your application.
Let's assume users are self categorizing their tweets by using hash tags (#). Also, lets go ahead and say a sorted count of these hash tags would determine if a topic was trending.
Now let's talk about the computer science part. Given our assumptions above, you will need to be able to quickly query and sort a collection of hashtags to figure out what is trending.
Your are using MongoDB and mongoid (with rails) so the simplest way to do this would be to create a collection that has tag documents that contain a count of their use. Create indexes on tag and count.
When someone tweets, figure out what the hash tags are, look them up in the tags collection and increment their count. To figure out what is trending, query the tags collection and sort by count. This would get you all-time trending hash tags.
If you wanted to get more specific, instead of just storing counts, store counts broken out by time deltas (week, day, hour etc) perhaps storing them separately. You could create documents that represent your time delta instead of the individual tags and store all the tags with their counts inside.
You could also use a capped collection. Hope that helps, all of this really depends on what you are trying to do. You can get really crazy and calculate the trends with time decay, etc. You could read the reddit or hacker news code to get a good idea of what that is like.