使用 Hadoop 处理来自 Web 服务器的实时日志

发布于 2024-08-24 00:47:40 字数 110 浏览 6 评论 0原文

我想使用 Hadoop (Amazon Elastic mapreduce) 处理来自 Web 服务器的日志。我用谷歌搜索帮助,但没有什么用处。我想知道是否可以做到这一点,或者是否有其他方法可以做到这一点。

I want to process the logs from my web server as it comes in using Hadoop (Amazon Elastic mapreduce). I googled for help but nothing useful. I would like to know if this can be done or is there any alternative way to do this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

莫多说 2024-08-31 00:47:40

Hadoop通常以离线方式使用。所以我宁愿定期处理日志。

在我之前参与的一个项目中,我们让服务器生成每小时轮换一次的日志文件(每小时 x:00)。我们有一个每小时运行一次的脚本(每小时 x:30)将文件上传到 HDFS(那些尚未存在的文件)。然后,您可以在 Hadoop 中随意运行作业来处理这些文件。

我确信还有更好的实时替代方案。

Hadoop is usually used in an offline manner. So I would rather process the logs periodically.

In a project I was involved with previously, we made our servers produce log files that were rotated hourly (every hour at x:00). We had a script that ran hourly (every hour at x:30) uploaded the files into HDFS (those that weren't already there). Then you can run jobs as often as you like in Hadoop to process these files.

I am sure there are better real-time alternatives too.

帝王念 2024-08-31 00:47:40

Hadoop 不用于实时处理。但它可以用于按小时处理日志,可能会落后一小时,这接近实时。我想知道处理日志的需要是什么。

Hadoop is not used for live real time processing. But it can be used to process logs on hourly basis may be one hour behind which is near real time. I wonder what is the need of processing logs as it comes.

才能让你更想念 2024-08-31 00:47:40

您可以尝试使用 Flume 作为日志收集器并将它们存储在 S3 中进行批处理:

http://www.cloudera.com/blog/2011/02/distributed-flume-setup-with-an-s3-sink/

Something you can try is to use Flume as a log collector and store them in S3 for batch processing:

http://www.cloudera.com/blog/2011/02/distributed-flume-setup-with-an-s3-sink/

相思碎 2024-08-31 00:47:40

如果您想要真正的实时处理,您可能需要看看 Twitter 的 Storm,它是开源的,托管在 GitHub 上。 教程在这里

看起来它正在大公司用于生产

在这一点上,我自己不使用 Storm,实际上做了类似于问题和回复中提到的事情:

  1. 使用 Apache 记录事件(使用循环日志每 15/30 分钟更改日志文件)
  2. 每隔一次上传它们经常向 S3
  3. 向现有 Hadoop 集群(在 Amazon EMR 上)添加新步骤

使用 Hadoop,您可以通过经常在集群上运行批处理并仅添加新作业来接近实时,但不是真正的实时。为此你需要 Storm 或类似的东西。

If you want true real-time processing, you might want to look at Twitter's Storm, which is open-source and hosted on GitHub. Tutorial here.

It looks like it is being used in production at large companies.

On that note, I don't use Storm myself, and actually do something similar to what has been in mentioned in the question and responses:

  1. Log events using Apache (using rotatelogs for changing log files every 15/30 minutes)
  2. Upload them every so often to S3
  3. Add a new step to an existing Hadoop cluster (on Amazon EMR)

With Hadoop, you can get close to real-time by running the batch processing often on a cluster and just adding a new job, but not true real-time. For that you need Storm or something similar.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文