Hadoop、硬件和生物信息学
我们即将购买新硬件来运行我们的分析,并想知道我们是否做出了正确的决定。
设置:
我们是一个生物信息学实验室,将处理 DNA 测序数据。我们这个领域面临的最大问题是数据量,而不是计算量。单个实验很快就会达到 10-100 Gb,我们通常会同时运行不同的实验。显然,mapreduce 方法很有趣(另请参见 http://abhishek-tiwari.com/2010/08/mapreduce-and-hadoop-algorithms-in-bioinformatics-papers.html),但并非我们所有的软件都使用该范例。此外,某些软件使用 ascii 文件作为输入/输出,而其他软件则使用二进制文件。
我们可能会购买什么:
我们可能购买的机器是具有 32 个内核和 192Gb RAM 的服务器,连接到 NAS 存储(>20Tb)。对于我们的许多(非 MapReduce)应用程序来说,这似乎是一个非常有趣的设置,但是这样的配置会阻止我们以有意义的方式实现 hadoop/mapreduce/hdfs 吗?
非常感谢,
一月
We're about to buy new hardware to run our analyses and are wondering if we're making the right decisions.
The setting:
We're a bioinformatics lab that will be handling DNA sequencing data. The biggest issue that our field has is the amount of data, rather than the compute. A single experiment will quickly go into the 10s-100s of Gb, and we would typically run different experiments at the same time. Obviously, mapreduce approaches are interesting (see also http://abhishek-tiwari.com/2010/08/mapreduce-and-hadoop-algorithms-in-bioinformatics-papers.html), but not all our software use that paradigm. Also, some software uses ascii files as in/output while other software works with binary files.
What we might be buying:
The machine that we might be buying would be a server with 32 cores and 192Gb of RAM, linked to NAS storage (>20Tb). This seems a very interesting setup for us for many of our (non-mapreduce) applications, but will such configuration prevent us from implementing hadoop/mapreduce/hdfs in a meaningful way?
Many thanks,
jan.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你有一个有趣的配置。您使用的 NAS 存储的磁盘 IO 是多少?
根据以下因素做出决定:
MapReduce范式用于解决处理大量数据的问题。基本上,RAM 比磁盘存储更昂贵。您无法将所有数据保存在 RAM 中。磁盘存储允许您以更便宜的成本存储大量数据。但是,从磁盘读取数据的速度不是很高。 MapReduce是如何解决这个问题的呢? MapReduce通过将数据分布在多台机器上来解决这个问题。现在,并行读取数据的速度比使用单个存储磁盘的速度要快。假设磁盘 IO 速度为 100 Mbps。使用 100 台机器,您可以以 100*100 Mbps = 10Gbps 的速度读取数据。
通常处理器速度不是瓶颈。相反,磁盘 IO 是处理大量数据时的大瓶颈。
我有一种感觉,效率可能不太高。
You have an interesting configuration. What would be the Disk IO for the NAS storage used by you?
Make your decision based on the following:
Map Reduce paradigm is used to solve the problem of handling large amount of data. Basically, RAM is more expensive than the Disk storage. You cannot hold all the data in the RAM. Disk storage allows you to store large amounts of data at cheaper costs. But, the speed at which you can read data from the disks is not very high. How does Map Reduce solve this problem? Map Reduce solves this problem by distributing the data over multiple machines. Now, the speed at which you can read data in parallel is greater than you could have done with a single storage disk. Suppose the Disk IO speed is 100 Mbps. With 100 machines you can read the data at 100*100 Mbps = 10Gbps.
Typically processor speed is not the bottleneck. Rather, the Disk IOs are the big bottlenecks while processing large amount of data.
I have a feeling that it may not be very efficient.