单击按钮即可进行大数据处理
如果您有一个应用程序对大型数据集执行一些繁重的计算,并且必须通过单击按钮尽快返回结果,那么有哪些架构设计可用于使这项工作大规模化?
例如,应用程序运行模拟来预测未来结果,然后对该数据以及历史数据进行一些统计分析。有大量的 CPU 来运行模拟,数据库需要大量插入,然后需要大量的大数据库读取来收集历史数据,还有更多的 CPU 来进行统计分析。
本质上,有大量数据需要处理(CPU 和 IO 密集型),理论上单击按钮即可显示结果。
我知道这并不总是一个现实的目标,具体取决于强度,但是完成此类任务的典型架构有哪些?
If you have an application that performs some heavy calculation on a large data set, and the results must be returned as quickly as possible at the click of a button, what are some architectural designs that are used to make this work large scale?
For example, an application runs a simulation to predict future results, and then does some statistical analysis on that data along with historical data. There is a good amount of CPU to run the simulation and DB heavy with inserting, then lots of big DB reads to collect historical data, and more CPU to do statistical analyses.
In essence, there is lots of data to process (both CPU and IO intensive), and the results should theoretically be shown at the click of a button.
I understand that this is not always a realistic goal depending on the intensity, but what are some typical architectures to accomplish such a task?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
谷歌这样做是为了返回搜索结果。
查看 Hadoop - http://hadoop.apache.org/ - 特别是 MapReduce。
“Hadoop MapReduce 是一种编程模型和软件框架,用于编写在大型计算节点集群上快速并行处理大量数据的应用程序。”
Google does this to return search results.
Check out Hadoop - http://hadoop.apache.org/ - and specifically, MapReduce.
"Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes."
Rob 建议了一种使用 MapReduce 的好方法。
我相信这种处理是数据挖掘过程的一部分,数据挖掘过程具有与传统请求响应模型不同的方法。
至少
创建一个表(非规范化)并将所有必要的信息存储在该表中,然后当用户实时需要该信息时,只需进行表查找并尽快获取信息。
但这种方法存在挑战,主要挑战之一是在这个非规范化表中填充数据。
大多数时候,它可以通过夜间工作或负载最小时填充此表的其他方式离线完成!
此方法是在典型的电子商务应用程序中看到“购买此商品的客户也购买了”时使用的方法之一。
有关更多信息和参考,请参阅
1- Sql Server Analysis Services
2- 项目到项目协同过滤(特别是参考亚马逊实施)
Rob has suggested a nice approach using MapReduce.
I believe this sort of processing is part of Data Mining process and data mining process has different approach than traditional request response model.
As a bare minimum
create a single table (denormalized) and store all the necessary information in the this table and then when users needs the information in real time , just do a table lookup and get the information as quickly as possible.
but there are challenges to this approach and one of the major challenge is to populate the data in this de-normalized table.
most of the time , it can be done offline may be by a night job or something which populates this table when the load is minimum !
This approach is one of the approach used in when you see "Customer who bought this item also bought” in a typical e-commerce application.
for more information and reference , please see
1- Sql Server Analysis Services
2- Item to Item Collaborative Filtering ( esp. refer Amazon implementation)