Hadoop(+HBase/HDFS)与 Mysql(或 Postgres)——要处理和查询的独立结构化数据负载
您好,SO,
我想听听尊敬的各位对以下内容的一些想法/评论。
我有 100M 条记录需要处理。我有 5 个节点(在 Rocks 集群中)来执行此操作。数据非常结构化,并且非常适合关系数据模型。我想并行做事情,因为我的处理需要一些时间。
在我看来,我有两个主要选择:
在每个节点上安装 mysql 并在每个节点上放置 20M 记录。使用头节点将查询委托给节点并聚合结果。 查询功能++,但是当我选择分区策略等时,我可能会遇到一些头痛的风险(问:这就是他们所说的 mysql/postgres 集群吗?)。真正糟糕的是,记录的处理现在由我来处理(如何跨机器分发等)...
或者安装 Hadoop、Hive 和 HBase(请注意,这可能不是最有效的方法存储我的数据,因为 HBase 是面向列的)并仅定义节点。我们用 MapReduce 范式编写所有内容,然后,砰的一声,我们从此过上了幸福的生活。这里的问题是我们失去了“实时”查询功能(我知道你可以使用 Hive,但不建议将其用于实时查询 - 我需要) - 因为我有时也有一些正常的 sql 查询要执行“从葡萄酒中选择 *,其中颜色 = '棕色'”。
请注意,理论上 - 如果我有 100M 台机器,我可以立即完成整个事情,因为对于每个记录,处理都是独立于其他记录的。另外 - 我的数据是只读的。我预计不会发生任何更新。我不需要/想要在一个节点上有 100M 记录。我不希望有冗余数据(因为有很多数据),因此将其保留在 mysql/postgres 和 Hadoop/HBase/HDFS 中。不是一个真正的选择。
非常感谢
Hi there at SO,
I would like some ideas/comments on the following from you honorable and venerable bunch.
I have a 100M records which I need to process. I have 5 nodes (in a rocks cluster) to do this. The data is very structured and falls nicely in the relational data model. I want to do things in parallel since my processing takes some time.
As I see it I have two main options:
Install mysql on each node and put 20M records on each. Use the head node to delegate queries to the nodes and aggregate the results. Query Capabilities++, but I might risk some headaches when I come to choose partitioning strategies etc. (Q. Is this what they call mysql/postgres cluster?). The really bad part is that the processing of the records is left up to me now to take care of (how to distribute across machines etc)...
Alternatively install Hadoop, Hive and HBase (note that this might not be the most efficient way to store my data, since HBase is column oriented) and just define the nodes. We write everything in the MapReduce paradigm and, bang, we live happily ever after. The problem here is that we loose the "real time" query capabilities (I know you can use Hive, but that is not suggested for real time queries - which I need) - since I also have some normal sql queries to execute at times "select * from wine where colour = 'brown'".
Note that in theory - if I had 100M machines I could do the whole thing instantly since for each record the processing is independent of the other. Also - my data is read-only. I do not envisage any updates happening. I do not need/want 100M records on one node. I do not want there to be redundant data (since there is lots of it) so keeping it in BOTH mysql/postgres and Hadoop/HBase/HDFS. is not a real option.
Many Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
你能证明MySQL是瓶颈吗? 100M 记录并不算多,而且看起来您没有执行复杂的查询。在不知道具体是什么样的处理的情况下,我会按以下顺序执行以下操作:
在你的情况下,我会抵制跳出 MySQL 的诱惑,除非绝对必要。
Can you prove that MySQL is the bottleneck? 100M records is not that many, and it looks like that you're not performing complex queries. Without knowing exactly what kind of processing, here is what I would do, in this order:
In your situation, I would resist the temptation to jump off of MySQL, unless it is absolutely necessary.
在提出建议之前,有几个问题需要问。
您可以制定仅通过主键访问的查询吗?换句话说 - 如果您可以避免所有连接和表扫描。如果是这样 - 如果您需要非常高的读/写访问率,HBase 是一个选择。
考虑到数据量较低,我并不认为 Hive 是一个好的选择。如果您期望它们显着增长 - 您可以考虑它。无论如何,Hive 都适合分析工作负载,而不适合 OLTP 类型的处理。
如果您确实需要具有连接和扫描的关系模型 - 我认为好的解决方案可能是一个主节点和 4 个从节点,并在它们之间进行复制。您将把所有写入定向到主服务器,并在整个集群之间平衡读取。如果您的读取次数多于写入次数,那就特别好。
在此架构中,您将在每个节点上拥有所有 100M 记录(不匹配)。如果合适,您可以在每个节点内使用分区。
There are a few questions to ask, before suggesting.
Can you formulate your queries to access by primary key only? In other words - if you can avoid all joins and table scans. If so - HBase is an option, if you need very high rate of read/write accesses.
I do noth thing that Hive is good option taking into consideration low data volume. If you expect them to grow significantly - you can consider it. In any case Hive is good for the analytical workloads - not for the OLTP type of processing.
If you do need relational model with joins and scans - I think good solution might be one Master Node and 4 slaves, with replication between them. You will direct all writes to the master, and balance reads among whole cluster. It is especially good if you have much more reads then writes.
In this schema you will have all 100M records (not that match) on each node. Within each node you can employ partitioning if appropriate.
您可能还需要考虑使用 Cassandra。我最近在 HBase 与 Cassandra 上发现了这篇文章 当我读到你的帖子时,我想起了这一点。
其要点是 Cassandra 是一个高度可扩展的 NoSQL 解决方案具有快速查询,这听起来像是您正在寻找的解决方案。
因此,这完全取决于您是否需要维护关系模型。
You may also want to consider using Cassandra. I recently discovered this article on HBase vs. Cassandra which I was reminded of when I read your post.
The gist of it is that Cassandra is a highly scallable NoSQL solution with fast querying, which sort of sounds like the solution you're looking for.
So, it all depends on whether you need to maintain your relational model or not.
你好,
我遇到过这样的情况,我有很多使用 sqlalchemy 和 python 多处理库并行创建的表。我有多个文件,每个表一个,并使用并行 COPY 进程加载它们。如果每个进程对应一个单独的表,那就很好了。对于一张表,使用 COPY 会很困难。我猜你可以在 PostgreSQL 中使用表分区。如果您有兴趣,我可以提供更多详细信息。
问候。
HI,
I had a situation where I had many tables which I created in parallel using sqlalchemy and the python multiprocessing library. I had multiple files, one per table, and loaded them using parallel COPY processes. If each process corresponds to a separate table, that works well. With one table, using COPY would be difficult. You could use tables partitioning in PostgreSQL, I guess. If you are interested I can give more details.
Regards.