SpringBatch 中的 MapReduce/聚合操作
是否可以在SpringBatch中进行MapReduce风格的操作?
我的批处理工作有两个步骤。第一步计算平均值。第二步将每个值与平均值进行比较以确定另一个值。
例如,假设我有一个巨大的学生成绩数据库。第一步计算每门课程/考试的平均分数。第二步将个人分数与平均分进行比较,根据一些简单的规则确定成绩:
- A 如果学生分数高于平均分
- B 如果学生分数为平均分
- C 如果学生分数低于平均分
目前我的第一步是一个 Sql,它选择平均分并将其写入到桌子。第二步是一个 Sql,它将平均分数与个人分数结合起来,并使用处理器来实现规则。
Steps 中经常使用类似的聚合函数,例如 avg、min,我真的更希望这可以在处理器中完成,从而使 Sqls 尽可能简单。有没有办法编写一个处理器,根据分组标准聚合多行结果,然后将平均值/最小值写入输出表一次?
这种模式重复了很多次,我并不是在寻找使用 Sql 来获取平均分数和个人分数的单处理器实现。
Is it possible to do MapReduce style operations in SpringBatch?
I have two steps in my batch job. The first step calculates average. The second step compares each value with average to determine another value.
For example, Lets say i have a huge database of Student scores. The first step calculates average score in each course/exam. The second step compares individual scores with average to determine grade based on some simple rule:
- A if student scores above average
- B if student score is Average
- C if student scores below average
Currently my first step is a Sql which selects average and writes it to a table. Second step is a Sql which joins average scores with individual scores and uses a Processor to implement the rule.
There are similar aggregation functions like avg, min used a lot in Steps and I'd really prefer if this can be done in Processors keeping the Sqls as simple as possible. Is there any way to write a Processor which aggregates results across multiple rows based on a grouping criteria and then Writes Average/Min to the Output table once?
This pattern repeats a lot and i'm not looking for a Single processor implementation using a Sql which fetches both average and individual scores.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是可能的。您甚至不需要多于一步。 Map-Reduce 可以一步实现。您可以创建一个与 ItemReader 和 ItemWriter 关联的步骤。将 ItemReader -ItemWriter 对视为 Map-Reduce。您可以通过使用具有适当行聚合的自定义读取器和写入器来实现必要的效果。对于您的读者/编写者来说,实现 Stream 接口以保证 Spring 批处理的中间 StepContext 保存操作可能是一个好主意。
我只是为了好玩而尝试,但我认为这是没有意义的,因为你的工作能力受到单个 JVM 的限制,换句话说:你无法达到 Hadoop 集群(或其他真实的 MapReduce 实现)生产环境的性能。此外,随着数据大小的增长,可扩展性将变得非常困难。
很好的观察,但在我看来,目前对于现实世界的任务来说毫无用处。
It is possible. You do not even need more than one step. Map-Reduce can be implemented in a single step. You can create a step with ItemReader and ItemWriter associated with it. Think of ItemReader -ItemWriter pair as of Map- Reduce. You can achieve the neccessary effect by using custom reader and writer with propper line aggregation. It might be a good idea for your reader/writer to implement Stream interface to guarantee intermediate StepContext save operation by Spring batch.
I tried it just for fun, but i think that it is pointless since your working capacity is limited by single JVM, in other words: you could not reach Hadoop cluster (or other real map reduce implementationns) production environment performance. Also it will be really hard to be scallable as your data size grows.
Nice observation but IMO currently useless for real world tasks.
我认为批处理框架应该将编程/配置和运行时问题分开。如果 Spring Batch 在所有主要批处理运行时(如 JVM、Hadoop Cluster(也使用 JVM)等)上提供通用解决方案,那就太好了
。 >使用Spring批处理编程/配置模型编写批处理程序,该模型集成了其他编程模型,如map-reduce、传统java等。
->根据您的需要选择运行时(单个 JVM 或 Hadoop 集群或 NoSQL)。
Spring Data试图解决其中的一部分,为各种类型的数据源提供统一的配置模型和API使用。)
I feel that a batch processing framework should separate programming/configuration and run-time concerns.It would be nice if spring batch provides a generic solution over a all major batch processing run times like JVM, Hadoop Cluster(also uses JVM) etc.
-> Write batch programs using Spring batch programming/Configuration model that integrates other programming models like map-reduce ,traditional java etc.
-> Select the run-times based on your need (single JVM or Hadoop Cluster or NoSQL).
Spring Data attempts solve a part of it, providing a unified configuration model and API usage for various type of data sources.).