实时查询/聚合数百万条记录 - hadoop?数据库?卡桑德拉?
我有一个可以并行化的解决方案,但我(还)没有 hadoop/nosql 的经验,并且我不确定哪种解决方案最适合我的需求。理论上,如果我有无限的 CPU,我的结果应该立即返回。因此,任何帮助将不胜感激。谢谢!
这是我所拥有的:
- 1000 个数据集
- 数据集键:
- 所有数据集都具有相同的键
- 100 万个密钥(以后可能是 10 或 2000 万个)
- 数据集列:
- 每个数据集都有相同的列
- 10 到 20 列
- 大多数列都是我们需要聚合的数值(avg、stddev 和使用 R 计算统计数据)
- 有几列是“type_id”列,因为在特定查询中我们可能 只想包含某些 type_id
- Web 应用程序
- 用户可以选择他们感兴趣的数据集(15 到 1000 个之间的任意数据集)
- 应用程序需要提供:每列的键和聚合结果(avg、stddev)
- 数据更新的键和聚合结果(avg、stddev):
- 可以添加、删除或替换/更新整个数据集
- 如果能够添加列就好了。但是,如果需要,可以替换整个数据集。
- 永远不要向数据集添加行/键 - 因此不需要具有大量快速写入的系统
- 基础设施的系统:
- 目前有两台机器,每台有 24 个核心
- 最终,希望能够在亚马逊上运行此功能
我无法预先计算我的聚合值,但由于每个键都是独立的,因此应该很容易扩展。目前,我将这些数据存储在 postgres 数据库中,其中每个数据集都位于其自己的分区中。
- 分区很好,因为可以轻松添加/删除/替换分区
- 数据库非常适合基于 type_id 进行过滤
- 数据库不容易编写并行查询
- 数据库适合结构化数据,并且我的数据不是结构化的
作为我尝试的概念证明out hadoop:
- 特定 type_id
- 为每个数据集创建一个制表符分隔的文件,上传到 hdfs 的
- 地图:为每个键检索一个值/列
- 减少:计算平均值和标准偏差
从我的粗略概念证明,我可以看到这将扩展很好,但我可以看到 hadoop/hdfs 有延迟,我读到它通常不用于实时查询(尽管我可以在 5 秒内将结果返回给用户)。
关于我应该如何处理这个问题有什么建议吗?我正在考虑接下来尝试一下 HBase 来感受一下。我应该看看 Hive 吗?卡桑德拉?伏地魔?
谢谢!
I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
- 1000s of datasets
- dataset keys:
- all datasets have the same keys
- 1 million keys (this may later be 10 or 20 million)
- dataset columns:
- each dataset has the same columns
- 10 to 20 columns
- most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
- a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
- web application
- user can choose which datasets they are interested in (anywhere from 15 to 1000)
- application needs to present: key, and aggregated results (avg, stddev) of each column
- updates of data:
- an entire dataset can be added, dropped, or replaced/updated
- would be cool to be able to add columns. But, if required, can just replace the entire dataset.
- never add rows/keys to a dataset - so don't need a system with lots of fast writes
- infrastructure:
- currently two machines with 24 cores each
- eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
- partitions are nice, since can easily add/drop/replace partitions
- database is nice for filtering based on type_id
- databases aren't easy for writing parallel queries
- databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
- created a tab separated file per dataset for a particular type_id
- uploaded to hdfs
- map: retrieved a value/column for each key
- reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
Hive 或 Pig 似乎不会帮助你。本质上,它们中的每一个都会编译为一个或多个映射/归约作业,因此响应不可能在 5 秒内完成,
HBase 可能会工作,尽管您的基础设施对于最佳性能来说有点小。我不明白为什么你不能预先计算每列的摘要统计信息。您应该查找计算运行平均值,这样您就不必进行大量的减重。
查看 http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt (E[X^2]- (E[X])^2)
来获取 AB 的 stddev
这意味着您可以通过执行sqrt(E[AB^2]-(E[AB])^2) 。 E[AB^2] 是 (sum(A^2) + sum(B^2))/(|A|+|B|)
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
由于您的数据似乎非常相似,因此我肯定会看看 Google BigQuery - 您可以无需 MapReduce 步骤(就您而言)即可摄取和分析数据,RESTful API 将帮助您根据查询创建 Web 应用程序。事实上,根据您想要如何设计应用程序,您可以创建一个相当“实时”的应用程序。
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
这是一个严重的问题,在开源领域没有立即好的解决方案。在商业空间中,像 greenplum/netezza 这样的 MPP 数据库应该可以。
理想情况下,您需要谷歌的 Dremel(BigQuery 背后的引擎)。我们正在开发开源克隆,但这需要一些时间......
无论使用什么引擎,我认为解决方案应该包括将整个数据集保存在内存中 - 它应该让您了解您需要的集群大小。
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
如果我理解正确的话,你只需要一次聚合单列
您可以以不同的方式存储数据以获得更好的结果
在 HBase 中,看起来像这样
今天的设置中每个数据列的表和用于过滤字段的另一个表(type_ids)
今天的设置中每个键的行 - 您可能需要考虑如何将过滤器字段合并到键中以进行有效过滤 - 否则您必须进行两阶段读取(
当今设置中每个表的列(即几千列)
HBase 不介意您添加新列,并且它是稀疏的,因为它不存储不存在的列的数据。
当您读取一行时,您将获得所有可以执行平均的相关值。等很容易
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
您可能想为此使用普通的旧数据库。听起来你没有交易系统。因此,您可能只能使用一两个大表。当您需要连接大数据时,SQL 会出现问题。但由于您的数据集听起来不需要加入,所以应该没问题。您可以设置索引来查找数据集,并在 SQL 或应用程序数学中执行。
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.