我想要对大量数据进行日志解析并收集分析信息。然而,所有数据都来自外部来源,我只有两台机器来存储 - 一台作为备份/复制。
我正在尝试使用 Hadoop、Lucene...来实现这一目标。但是,所有培训文档都提到 Hadoop 对于分布式处理、多节点很有用。我的设置不适合该架构。
仅使用 2 台机器使用 Hadoop 是否有任何开销?如果 Hadoop 不是一个好的选择,还有其他选择吗?我们研究了 Splunk,我们喜欢它,但它对我们来说买起来很贵。我们只想建立自己的。
I want to do log parsing of huge amounts of data and gather analytic information. However all the data comes from external sources and I have only 2 machines to store - one as backup/replication.
I'm trying to using Hadoop, Lucene... to accomplish that. But, all the training docs mention that Hadoop is useful for distributed processing, multi-node. My setup does not fit into that architecture.
Are they any overheads with using Hadoop with just 2 machines? If Hadoop is not a good choice are there alternatives? We looked at Splunk, we like it, but that is expensive for us to buy. We just want to build our own.
发布评论
评论(1)
Hadoop应该用于分布式批处理问题。
5-common-questions-about-hadoop
日志文件分析是 Hadoop 更常见的用途之一,也是 Facebook 使用它执行的任务之一。
如果您有两台机器,那么根据定义您就有一个多节点集群。如果需要,您可以在单台计算机上使用 Hadoop,但随着添加更多节点,处理相同数据量所需的时间会减少。
你说你有海量数据?这些是需要理解的重要数字。就我个人而言,当我认为数据量很大时,我认为在 100 TB 以上范围内。如果是这种情况,您可能需要两台以上的机器,特别是如果您想通过 HDFS 使用复制。
您想要收集的分析信息?您是否确定可以使用 MapReduce 方法来回答这些问题?
如果您的硬件资源有限,您可以考虑在 Amazon EC2 上使用 Hadoop。以下是一些可帮助您入门的链接:
Hadoop should be used for distributed batch processing problems.
5-common-questions-about-hadoop
Analysis of log files is one of the more common uses of Hadoop, its one of the tasks Facebook use it for.
If you have two machines, you by definition have a multi-node cluster. You can use Hadoop on a single machine if you want, but as you add more nodes the time it takes to process the same amount of data is reduced.
You say you have huge amounts of data? These are important numbers to understand. Personally when I think huge in terms of data, i think in the 100s terabytes+ range. If this is the case, you'll probably need more than two machines, especially if you want to use replication over the HDFS.
The analytic information you want to gather? Have you determined that these questions can be answered using the MapReduce approach?
Something you could consider would be to use Hadoop on Amazons EC2 if you have a limited amount of hardware resources. Here are some links to get you started: