获取最常见的项目而不需要计算每个项目

发布于 2024-08-31 04:53:01 字数 142 浏览 6 评论 0原文

我想知道是否有一种算法可以计算“最常见的项目”而不必保留每个项目的计数?例如,假设我是一个搜索引擎,想要跟踪 10 个最热门的搜索。我不想做的是保留每个查询的计数器,因为可能有太多查询让我无法计数(而且大多数查询都是单例)。有一个简单的算法吗?也许是概率性的事情?谢谢!

I was wondering if there was an algorithm for counting "most frequent items" without having to keep a count of each item? For example, let's say I was a search engine and wanted to keep track of the 10 most popular searches. What I don't want to do is keep a counter of every query since there could be too many queries for me to count (and most them will be singletons). Is there a simple algorithm for this? Maybe something that is probabilistic? Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

婴鹅 2024-09-07 04:53:01

好吧,如果您有大量查询(就像搜索引擎一样),那么您可以只对查询进行“采样”。因此,您可能每秒会收到 1,000 个查询,但如果您只每秒计数一次,那么在相当长的一段时间内,您会得到一个相对接近“真实”答案的答案。

例如,“采样”分析器就是这样工作的。每n 毫秒它会查看当前正在执行的函数。经过很长一段时间(几秒钟),您就会对“昂贵”函数有一个很好的了解,因为它们是更频繁地出现在样本中的函数。

您仍然需要进行“计数”,但是通过定期采样,而不是对每个查询进行计数,您可以获得实际必须存储的数据量的上限(例如,一个查询的最大值)每秒等)

Well, if you have a very large number of queries (like a search engine presumably would), then you could just do "sampling" of queries. So you might be getting 1,000 queries per second, but if you just keep a count one per second, then over a longish period of time, you'd get an answer that would be relatively close to the "real" answer.

This is how, for example, a "sampling" profiler works. Every n mililiseconds it looks at what function is currently being executed. Over a long period of time (several seconds) you get a good idea of the "expensive" functions, because they're the ones that appear in your samples more often.

You still have to do "counting" but by doing periodic samples, instead of counting every single query you can get an upper bound on the amount of data that you actually have to store (e.g. max of one query per second, etc)

这个俗人 2024-09-07 04:53:01

如果您希望在任何给定时间进行最频繁的搜索,则不需要使用无休止的计数器来跟踪每个提交的查询。相反,您需要一种算法来测量任何给定查询的提交量除以设定的时间段。这是一个非常简单的算法。提交给搜索引擎的任何搜索(例如“缓存”一词)都会存储一段固定的时间,称为刷新率(刷新率的长度取决于搜索引擎获得的流量类型和流量)您想要跟踪的“最佳结果”)。如果刷新率时间段到期并且对“缓存”一词的搜索没有持续存在,则查询被删除内存。如果对“缓存”一词的搜索持续存在,您的算法只需跟踪搜索“缓存”一词的速率。为此,只需将所有搜索存储在“泄漏计数器”上即可。每个条目都被推送到计数器上,并带有一个到期日期,到期后查询将被删除。您的活跃计数器是您的热门查询的指标。

If you want the most frequent searches at any given time, you don't need to have endless counters keeping track of each query submitted. Instead, you need an algorithm to measure the amount of submissions for any given query divided by a set period of time. This is a pretty simple algorithm. Any search submitted to your search engine, for example the word “cache,” is stored for a fixed period of time called a refresh rate, (the length of your refresh rate depends on the kind of traffic your search engine is getting and the amount of “top-results” you want to keep track of). If the refresh rate time period expires and searches for the word “cache” have not persisted, the query is deleted memory. If searches for the word “cache” do persist, your algorithm only needs to keep track of the rate at which the word “cache” is being searched. To do this, simply store all searches on a “leaky-counter.” Every entry is pushed onto the counter with an expiration date after which the query is deleted. Your active counters are the indicators of your top queries.

夜雨飘雪 2024-09-07 04:53:01

存储每个查询的成本很高,但为了确保前 10 个查询实际上是前 10 个查询是必要的。您必须作弊。

一种想法是存储一个包含 URL、点击计数器和按计数索引的时间戳的表,然后是时间戳。当表达到任意接近最大大小时,开始删除早于给定天数的低端条目。虽然旧的、不常见的查询不会被计算在内,但可能进入前 10 名的查询应该出现在表中,因为查询速度更快。

另一个想法是为搜索查询编写 16 位(或更多)哈希函数。有一个包含 65536 个条目的表,其中包含计数器和 URL。执行搜索时,增加相应的表条目并根据需要设置 URL。然而,这种方法有一个主要缺点。垃圾邮件机器人可能会重复查询,例如“廉价伟哥”,可能会使合法查询增加垃圾邮件查询计数器,从而将其消息放置在您的主页上。

Storing each and every query would be expensive, yet necessary to ensure the top 10 are actually the top 10. You'll have to cheat.

One idea is to store a table of URLs, hit counters, and timestamp indexed by count, then timestamp. When the table reaches some arbitrary near-maximum size, start removing low-end entries that are older than a given number of days. Although old, infrequent queries won't be counted, the queries likely to make the top 10 should make it on the table because of the faster query rate.

Another idea would be to write a 16-bit (or more) hash function for search queries. Have a 65536-entry table holding counters and URLs. When a search is performed, increment the respective table entry and set the URL if necessary. However, this approach has a major drawback. A spam bot could make repeated queries like "cheap viagra", possibly making legitimate queries increment the spam query counters instead, placing their messages on your main page.

转角预定愛 2024-09-07 04:53:01

你想要一个缓存,缓存有很多种;参见维基百科
缓存算法
页面替换算法 老化。

You want a cache, of which there are many kinds; see Wikipedia
Cache algorithms and
Page replacement algorithm Aging.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文