大数据处理技术与应用图书
我正在寻找有关如何有效查询大量数据的良好资源。
每个数据项都表示为许多不同的属性,例如数量、价格、历史信息等。客户端将提供不同的查询条件,但不需要更改数据集。简单地将所有数据存储到 MS SQL 中并不是一个好方法,因为 MS SQL 的可扩展性不是很好。在这里,我们的目标是许多 TB 数据,需要 200-300 个 CPU 集群。
我对好的资源或书籍感兴趣,我至少可以做一些研究。
I am looking for good resources on how to query large volume of data efficiently.
Each data item is represented as many different attributes such as quantity, price, history info, etc. The client will provide different query criteria but without requirement to change the dataset. By simply storing all data into MS SQL is not a good method b/c the scalability of MS SQL is not that good. Here we are targeting many tera byte data and need 200-300 CPU clusters.
I am interested in good resources or books that I can at least do some research.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您是否将 NoSql 解决方案视为 MongoDb ?
Did you consider NoSql solution as MongoDb ?
如果查询速度不是您的首要问题,您应该看看是否可以使用 ROOT 构建解决方案,可能与证明结合使用。与 NoSql 解决方案相比,您在这里需要牺牲一致性来换取一定的速度。
CERN 实验使用它来存储和检索实验数据(比您需要的多得多),如果您能找到一种处理 I/O 的方法,它可以很好地扩展。
我听说一些做量化金融的公司使用它。
If query speed is not your number one issue you should see if you could build a solution with ROOT, possibly in conjunction with PROOF. In contrast to a NoSql solution you would here trade consistency for some speed.
It is used by the CERN experiments to store and retrieve their experimental data (much more than you require) and if you can find a way to handle the I/O it can be made to scale pretty well.
I have heard it is used by some firms doing quantitative finance.