使用 Hadoop 处理来自 Web 服务器的实时日志
我想使用 Hadoop (Amazon Elastic mapreduce) 处理来自 Web 服务器的日志。我用谷歌搜索帮助,但没有什么用处。我想知道是否可以做到这一点,或者是否有其他方法可以做到这一点。
I want to process the logs from my web server as it comes in using Hadoop (Amazon Elastic mapreduce). I googled for help but nothing useful. I would like to know if this can be done or is there any alternative way to do this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Hadoop通常以离线方式使用。所以我宁愿定期处理日志。
在我之前参与的一个项目中,我们让服务器生成每小时轮换一次的日志文件(每小时 x:00)。我们有一个每小时运行一次的脚本(每小时 x:30)将文件上传到 HDFS(那些尚未存在的文件)。然后,您可以在 Hadoop 中随意运行作业来处理这些文件。
我确信还有更好的实时替代方案。
Hadoop is usually used in an offline manner. So I would rather process the logs periodically.
In a project I was involved with previously, we made our servers produce log files that were rotated hourly (every hour at x:00). We had a script that ran hourly (every hour at x:30) uploaded the files into HDFS (those that weren't already there). Then you can run jobs as often as you like in Hadoop to process these files.
I am sure there are better real-time alternatives too.
Hadoop 不用于实时处理。但它可以用于按小时处理日志,可能会落后一小时,这接近实时。我想知道处理日志的需要是什么。
Hadoop is not used for live real time processing. But it can be used to process logs on hourly basis may be one hour behind which is near real time. I wonder what is the need of processing logs as it comes.
您可以尝试使用 Flume 作为日志收集器并将它们存储在 S3 中进行批处理:
http://www.cloudera.com/blog/2011/02/distributed-flume-setup-with-an-s3-sink/
Something you can try is to use Flume as a log collector and store them in S3 for batch processing:
http://www.cloudera.com/blog/2011/02/distributed-flume-setup-with-an-s3-sink/
如果您想要真正的实时处理,您可能需要看看 Twitter 的 Storm,它是开源的,托管在 GitHub 上。 教程在这里。
看起来它正在大公司用于生产。
在这一点上,我自己不使用 Storm,实际上做了类似于问题和回复中提到的事情:
使用 Hadoop,您可以通过经常在集群上运行批处理并仅添加新作业来接近实时,但不是真正的实时。为此你需要 Storm 或类似的东西。
If you want true real-time processing, you might want to look at Twitter's Storm, which is open-source and hosted on GitHub. Tutorial here.
It looks like it is being used in production at large companies.
On that note, I don't use Storm myself, and actually do something similar to what has been in mentioned in the question and responses:
With Hadoop, you can get close to real-time by running the batch processing often on a cluster and just adding a new job, but not true real-time. For that you need Storm or something similar.