最初,我只需要处理 1.5[TB] 的数据。由于我只需要快速写入/读取(无需任何 SQL),因此我设计了自己的平面二进制文件格式(使用 python 实现)并轻松(并且愉快地)保存我的数据并在一台机器上对其进行操作。当然,出于备份目的,我添加了两台机器作为精确镜像(使用rsync
)。
目前,我的需求不断增长,需要构建一个能够成功扩展到 20[TB](甚至更多)数据的解决方案。 我很乐意继续使用我的平面文件格式进行存储。它快速、可靠并为我提供所需的一切。
我关心的是复制、数据一致性等(显然,数据必须分布——不是所有数据
都可以存储在一台机器
上)网络。
是否有任何现成的
解决方案(基于Linux / python
)可以让我继续使用我的文件格式进行存储,但可以处理NoSql
解决方案通常提供的其他组件是什么? (数据一致性/可用性/易于复制)?
基本上,我想要确保的是我的二进制文件在整个网络中保持一致。我正在使用由 60 台双核机器组成的网络(每台机器都有 1GB RAM
和 1.5TB 磁盘
)
Originally, I had to deal with just 1.5[TB] of data. Since I just needed fast write/reads (without any SQL), I designed my own flat binary file format (implemented using python
) and easily (and happily) saved my data and manipulated it on one machine. Of course, for backup purposes, I added 2 machines to be used as exact mirrors (using rsync
).
Presently, my needs are growing, and there's a need to build a solution that would successfully scale up to 20[TB] (and even more) of data. I would be happy to continue using my flat file format for storage. It is fast, reliable and gives me everything I need.
The thing I am concerned about is replication, data consistency etc (as obviously, data will have to be distributed -- not all data
can be stored on one machine
) across the network.
Are there any ready-made
solutions (Linux / python based
) that would allow me to keep using my file format for storage, yet would handle the other components that NoSql
solutions normally provide? (data consistency / availability / easy replication)?
basically, all I want to make sure is that my binary files are consistent throughout my network. I am using a network of 60 core-duo machines (each with 1GB RAM
and 1.5TB disk
)
发布评论
评论(2)
方法:使用 Disco 项目在 Python 中进行分布式映射缩减
似乎是解决问题的好方法。我使用迪斯科项目也遇到类似的问题。
您可以将文件分布在n台机器(进程)之间,并实现适合您逻辑的map和reduce函数。
disco 项目的教程,准确地描述了如何针对您的问题实施解决方案。您会对需要编写的代码之少感到印象深刻,而且您绝对可以保留二进制文件的格式。
另一个类似的选项是使用 Amazon 的 Elastic MapReduce
Approach: Distributed Map reduce in Python with The Disco Project
Seems like a good way of approaching your problem. I have used the disco project with similar problems.
You can distribute your files among n numbers of machines (processes), and implement the map and reduce functions that fit your logic.
The tutorial of the disco project, exactly describes how to implement a solution for your problems. You'll be impressed about how little code you need to write and definitely you can keep the format of your binary file.
Another similar option is to use Amazon's Elastic MapReduce
也许为 Tarsnap 开发的 Kivaloo 系统的一些评论将帮助您决定最合适的:http://www.daemonology.net/blog/2011-03-28-kivaloo-data-store.html
不了解有关您的应用程序的更多信息(记录的大小/类型、频率读/写)或自定义格式很难说更多。
Perhaps some of the commentary on the Kivaloo system developed for Tarsnap will help you decide what's most appropriate: http://www.daemonology.net/blog/2011-03-28-kivaloo-data-store.html
Without knowing more about your application (size/type of records, frequency of reading/writing) or custom format it's hard to say more.