降低数据集的粒度
我有一个内存缓存,它通过一定程度的聚合来存储一组信息 - 在下面的学生示例中,假设我按年份、主题、教师存储它:
# Students Year Subject Teacher
1 30 7 Math Mrs Smith
2 28 7 Math Mr Cork
3 20 8 Math Mrs Smith
4 20 8 English Mr White
5 18 8 English Mr Book
6 10 12 Math Mrs Jones
现在不幸的是,我的缓存没有 GROUP BY 或类似的函数- 因此,当我想以更高的聚合级别查看事物时,我必须自己“汇总”数据。例如,如果我按年份、主题汇总学生,上述数据将如下所示:
# Students Year Subject
1 58 7 Math
2 20 8 Math
3 38 8 English
4 10 12 Math
我的问题是 - 我如何在 Java 中最好地做到这一点?理论上,我可以从该缓存中提取数以万计的对象,因此能够快速“汇总”这些集合可能变得非常重要。
我最初的(也许是天真的)想法是按照以下方式做一些事情;
直到我用尽记录列表:
- 我来的每一个“独特”记录 across 作为键添加到 哈希图。
- 如果我遇到这样的记录 对于这个新级别具有相同的数据 的聚合,将其数量添加到 现有的。
现在据我所知,这是一个相当普遍的问题,并且有更好的方法可以做到这一点。因此,我欢迎任何关于我是否为自己指明正确方向的反馈。
恐怕“获取新缓存”不是一个选项:)
-Dave。
I have an in-memory cache which stores a set of information by a certain level of aggregation - in the Students example below let's say I store it by Year, Subject, Teacher:
# Students Year Subject Teacher
1 30 7 Math Mrs Smith
2 28 7 Math Mr Cork
3 20 8 Math Mrs Smith
4 20 8 English Mr White
5 18 8 English Mr Book
6 10 12 Math Mrs Jones
Now unfortunately my cache doesn't have GROUP BY or similar functions - so when I want to look at things at a higher level of aggregation, I will have to 'roll up' the data myself. For example, if I aggregate Students by Year, Subject the aforementioned data would look like so:
# Students Year Subject
1 58 7 Math
2 20 8 Math
3 38 8 English
4 10 12 Math
My question is thus - how would I best do this in Java? Theoretically I could be pulling back tens of thousands of objects from this cache, so being able to 'roll up' these collections quickly may become very important.
My initial (perhaps naive) thought would be to do something along the following lines;
Until I exhaust the list of records:
- Each 'unique' record that I come
across is added as a key to a
hashmap. - If I encounter a record that
has the same data for this new level
of aggregation, add its quantity to
the existing one.
Now for all I know this is a fairly common problem and there's much better ways of doing this. So I'd welcome any feedback as to whether I'm pointing myself in the right direction.
"Get a new cache" not an option I'm afraid :)
-Dave.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你的“最初想法”并不是一个坏方法。改进它的唯一方法是为您聚合的字段(年份和主题)建立一个索引。 (这基本上就是 dbms 在定义索引时所做的事情。)然后您的算法可以被重新设计为迭代所有索引值;您不必检查每条记录的结果哈希。
当然,您必须在填充缓存时构建索引并在数据修改时维护它。
Your "initial thought" isn't a bad approach. The only way to improve on it would be to have an index for the fields on which you are aggregating (year and subject). (That's basically what a dbms does when you define an index.) Then your algorithm could be recast as iterating through all index values; you wouldn't have to check the results hash for each record.
Of course, you would have to build the index when populating the cache and maintain it as data is modified.