降低数据集的粒度

发布于 2024-11-14 20:15:22 字数 1039 浏览 4 评论 0原文

我有一个内存缓存,它通过一定程度的聚合来存储一组信息 - 在下面的学生示例中,假设我按年份、主题、教师存储它:

#    Students    Year    Subject    Teacher
1    30          7       Math       Mrs Smith
2    28          7       Math       Mr Cork
3    20          8       Math       Mrs Smith
4    20          8       English    Mr White
5    18          8       English    Mr Book
6    10          12      Math       Mrs Jones

现在不幸的是,我的缓存没有 GROUP BY 或类似的函数- 因此,当我想以更高的聚合级别查看事物时,我必须自己“汇总”数据。例如,如果我按年份、主题汇总学生,上述数据将如下所示:

#    Students    Year    Subject
1    58          7       Math
2    20          8       Math 
3    38          8       English
4    10          12      Math

我的问题是 - 我如何在 Java 中最好地做到这一点?理论上,我可以从该缓存中提取数以万计的对象,因此能够快速“汇总”这些集合可能变得非常重要。

我最初的(也许是天真的)想法是按照以下方式做一些事情;

直到我用尽记录列表:

  • 我来的每一个“独特”记录 across 作为键添加到 哈希图。
  • 如果我遇到这样的记录 对于这个新级别具有相同的数据 的聚合,将其数量添加到 现有的。

现在据我所知,这是一个相当普遍的问题,并且有更好的方法可以做到这一点。因此,我欢迎任何关于我是否为自己指明正确方向的反馈。

恐怕“获取新缓存”不是一个选项:)

-Dave。

I have an in-memory cache which stores a set of information by a certain level of aggregation - in the Students example below let's say I store it by Year, Subject, Teacher:

#    Students    Year    Subject    Teacher
1    30          7       Math       Mrs Smith
2    28          7       Math       Mr Cork
3    20          8       Math       Mrs Smith
4    20          8       English    Mr White
5    18          8       English    Mr Book
6    10          12      Math       Mrs Jones

Now unfortunately my cache doesn't have GROUP BY or similar functions - so when I want to look at things at a higher level of aggregation, I will have to 'roll up' the data myself. For example, if I aggregate Students by Year, Subject the aforementioned data would look like so:

#    Students    Year    Subject
1    58          7       Math
2    20          8       Math 
3    38          8       English
4    10          12      Math

My question is thus - how would I best do this in Java? Theoretically I could be pulling back tens of thousands of objects from this cache, so being able to 'roll up' these collections quickly may become very important.

My initial (perhaps naive) thought would be to do something along the following lines;

Until I exhaust the list of records:

  • Each 'unique' record that I come
    across is added as a key to a
    hashmap.
  • If I encounter a record that
    has the same data for this new level
    of aggregation, add its quantity to
    the existing one.

Now for all I know this is a fairly common problem and there's much better ways of doing this. So I'd welcome any feedback as to whether I'm pointing myself in the right direction.

"Get a new cache" not an option I'm afraid :)

-Dave.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

上课铃就是安魂曲 2024-11-21 20:15:22

你的“最初想法”并不是一个坏方法。改进它的唯一方法是为您聚合的字段(年份和主题)建立一个索引。 (这基本上就是 dbms 在定义索引时所做的事情。)然后您的算法可以被重新设计为迭代所有索引值;您不必检查每条记录的结果哈希。

当然,您必须在填充缓存时构建索引并在数据修改时维护​​它。

Your "initial thought" isn't a bad approach. The only way to improve on it would be to have an index for the fields on which you are aggregating (year and subject). (That's basically what a dbms does when you define an index.) Then your algorithm could be recast as iterating through all index values; you wouldn't have to check the results hash for each record.

Of course, you would have to build the index when populating the cache and maintain it as data is modified.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文